The Handbook of Data Science and AI
The Handbook of Data Science and AI
2024
Katherine MUNRO
Stefan PAPP
2nd Edition
Zoltan TOTH
Wolfgang WEIDINGER
Danko NIKOLIC
THE HANDBOOK OF
DATA SCIENCE
AND AI
Generate Value from Data
with Machine Learning
and Data Analytics
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
The Handbook of
Data Science and AI
Generate Value from Data
with Machine Learning and
Data Analytics
2nd Edition
Distributed by:
Carl Hanser Verlag
Postfach 86 04 20, 81631 Munich, Germany
Fax: +49 (89) 98 48 09
www.hanserpublications.com
www.hanser-fachbuch.de
The use of general descriptive names, trademarks, etc., in this publication, even if the former are
not especially identified, is not to be taken as a sign that such names, as understood by the Trade
Marks and Merchandise Marks Act, may accordingly be used freely by anyone. While the advice
and information in this book are believed to be true and accurate at the date of going to press,
neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied,
with respect to the material contained herein.
The final determination of the suitability of any information for the use contemplated for a given
application remains the sole responsibility of the user.
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying or by any information storage and
retrieval system, without permission in writing from the publisher.
No part of the work may be used for the purposes of text and data mining without the written
consent of the publisher, in accordance with § 44b UrhG (German Copyright Law).
Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXI
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XXIII
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stefan Papp
1.1 About this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Halford Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Alice Halford – Chairwoman . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Analysts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 “CDO” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Sales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.5 IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.6 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.7 Production Leader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.8 Customer Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.9 HR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.10 CEO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Stefan Papp
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Cloud Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 XaaS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Cloud Providers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Native Cloud Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.4 Cloud-native Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.2 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Linux Essentials for Data Professionals . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.4 Infrastructure as Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.4.1 Cloud Native PaaS Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4.2 External Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Data Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Zoltan C. Toth and Sean McIntyre
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1.1 Maslow’s Hierarchy of Needs for Data . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.2 Data Architecture Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1.3 The Structure of a Typical Data Architecture . . . . . . . . . . . . . . . . . . . 74
4.1.4 ETL (Extract, Transform, Load) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.5 ELT (Extract, Load, Transform) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.6 ETLT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Data Ingestion and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.2 Traditional File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.3 Modern File Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.4 Which Storage Option to Choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Data Warehouses, Data Lakes, and Lakehouses . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.1 Data Warehouses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3.2 Data Lakes and Cloud Data Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4 Data Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.2 Big Data & Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.3 Cloud Data Platforms for Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . 110
4.5 Workflow Orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.1 Dagster and the Modern Data Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Table of Contents VII
9 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Annalisa Cadonna
9.1 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
9.1.1 Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
9.1.2 Operations between Vectors and Matrices . . . . . . . . . . . . . . . . . . . . . . 247
9.1.3 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
9.1.4 Eigenvalues, Eigenvectors, and Eigendecomposition . . . . . . . . . . . . . 251
9.1.5 Other Matrix Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
9.2 Calculus and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
9.2.1 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9.2.2 Gradient and Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
9.2.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
9.2.4 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
9.3 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
9.3.1 Discrete and Continuous Random Variables . . . . . . . . . . . . . . . . . . . . 261
9.3.2 Expected Value, Variance, and Covariance . . . . . . . . . . . . . . . . . . . . . 264
9.3.3 Independence, Conditional Distributions, and Bayes’ Theorem . . . . 266
9.4 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
X Table of Contents
27 Trustworthy AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
Rania Wazir
27.1 Legal and Soft-Law Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796
27.1.1 Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
27.1.2 Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798
27.2 AI Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
27.3 Fairness in AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
27.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802
27.3.2 Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805
27.3.3 Mitigating Unwanted Bias in AI Systems . . . . . . . . . . . . . . . . . . . . . . . 808
27.4 Transparency of AI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809
27.4.1 Documenting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
27.4.2 Documenting the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
27.4.3 Explainability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812
27.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
27.6 In a Nutshell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814
28 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
Stefan Papp
28.1 Halford 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
28.1.1 Environmental, Social and Governance . . . . . . . . . . . . . . . . . . . . . . . . 816
28.1.2 HR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
28.1.3 Customer Satisfaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
28.1.4 Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
28.1.5 IT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
28.1.6 Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
XX Table of Contents
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Preface
■ Acknowledgments
We, the authors, would like to take this opportunity to express our sincere gratitude to our
families and friends, who helped us to express our thoughts and insights in this book. With-
out their support and patience, this work would not have been possible.
A special thanks from all the authors goes to Katherine Munro, who contributed a lot to this
book and spent a tremendous amount of time and effort editing our manuscripts.
For my parents, who always said I could do anything. We never expected it would be a thing
like this.
Katherine Munro
I’d like to thank my wife and the Vienna Data Science Group for their continuous support
through my professional journey.
Zoltan C. Toth
Thinking about the people who supported me most, I want to thank my parents, who have
always believed in me, no matter what, and my partner Verena, who was very patient again
during the last months while I worked on this book.
In addition I’m very grateful for the support and motivation I got from the people I met
through the Vienna Data Science Group.
Wolfgang Weidinger
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Introduction
Stefan Papp
tions may encounter when aiming to become more data-driven. By the end of this book, our
hypothetical company will also serve as a model of how a data-driven company could look.
In the chapters in between, we’ll address many of these challenges and provide practical
advice on how to tackle them.
Suppose you, as a reader, would rather not read prose about an invented company in order
to learn about such typical organizational challenges. In that case, we encourage you to skip
this chapter and start with one that fits your interests. As a holistic book on this field, the
authors discuss artificial intelligence, machine learning, generative AI, modeling, natural
language processing, computer vision, and other relevant areas. We cover engineering-
related topics such as data architecture and data pipelines, which are essential for getting
data-driven projects into production. Lastly, we also address critical social and legal issues
surrounding the use of data. Each author goes into a lot of detail for their specific field, so
there’s plenty for you to learn from.
We kindly ask readers to contact us directly to provide feedback on how we can do better to
achieve our ambitious goal of becoming the standard literature providing a holistic approach
to this field. If you feel some new content should be covered in one of the subsequent edi-
tions, you can find the authors on professional networks such as LinkedIn.
And with that said, let’s get started.
“I’m Alice Halford,” she said, “I am the granddaughter of Big Harry Halford, the founder of
this group. He built his empire from the ground up.”
Bob had read all the legends about the old Halford boss. Every article about him made it
clear he did not listen to many people. Instead, “Big Harry” was a proud, determined cap-
tain; one who set the course and demanded absolute obedience from his team. Business
magazines were yet to write much about Alice, as far as Bob knew. However, he had read one
article in preparation for this meeting. Alice was different from the grand old family patri-
arch, it had said. She had won the succession in a fierce battle against three ambitious
brothers, and been selected by the board as chairwoman, thanks to her big plans to transi-
tion the company into a modern enterprise that could meet the Zeitgeist of the 21st century.
“Although successful, today’s generation would call my granddad a dinosaur who just
wanted to leave enough footprints to let the next generation know he had been there,” Alice
said. “Especially in his last years, he was skeptical about changes. Many principal consul-
tants from respectable companies came with heads high to our offices, explaining that our
long-term existence would depend on becoming a data-driven company. However, my grand-
dad always had a saying: The moment a computer decides, instead of a founder who knows
their stuff and follows their gut, it’s over. All the once proud consultants and their sup
porters from within the company thought they could convince every executive to buy into
their ideas of a modern company, but ultimately, they walked out with their tails between
their legs.”
Alice smiled at Bob and continued, “my granddad’s retirement was long overdue, but, fi-
nally, his exotic Cuban cigars and his habit of drinking expensive whiskey forced him to end
his work life. I took over as a chairwoman of the board. I want to eliminate all the smells of
the last century. When I joined, I found parts of the company were highly toxic. My strategic
consultants advised me that every large organization has some organizational arrogance
and inefficiency. They also cautioned me to keep my expectations low. While many enthu
siasts claim that AI will change the world forever, every large organization is like a living
organism with many different subdivisions and characteristics. Changing a company’s
culture is a long process, and many companies face similar challenges. Ultimately, every
company is run by people, and nobody can change people over night. Some might be okay
with changes, a few may even want them to happen too fast, but most people will resist
changes in one way or another.
At the same time, I understand that we are running out of time. We learned that our main
competitors are ahead of us, and if we do not catch up, we will eventually go out of business.
Our current CEO has a background in Finance and, therefore, needs support from a data
strategist. Bob, you have been recommended as the most outstanding expert to transform a
company into a data-driven enterprise that disrupts traditional business models. You can
talk with everyone; you have all the freedom you need. After that, I am curious about your
ideas to change the company from the ground up.”
Bob nodded enthusiastically. “I love challenges. Your secretary already told me I shouldn’t
have any other appointments in the afternoon. Can you introduce me to your team? I would
love to learn more about how they work, and their requirements.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4 1 Introduction
“I thought you’d want to do that. First, you will meet David and Anna, the analysts. Then
you’ll meet Tom, the sales director. It would be best if you also talked with the IT manager,
Peter—” Alice stopped herself, sighed, and continued. “Lastly, I arranged a meeting for you
with our production leader, the complaints department, our Head of Security, and finally
with our HR. I will introduce our new CEO, who is flying in today to discuss details at dinner.
I booked a table in a good restaurant close by. But it makes sense if you first talk to all the
other stakeholders. I had my colleagues each arrange a one-on-one with you. You’re in for a
busy afternoon, Bob.”
1.2.2 Analysts
As Alice swept out of the room, a bespeckled man apparently in his mid-forties, and a
woman of about the same age, appeared in the doorway. It must have been the analysts,
David and Anna. When neither appeared willing to enter the room first, Bob beckoned them
inside. He was reminded of an empowerment seminar he’d attended some years ago: The
trainer had been hell bent on turning everyone in the workshop into strong leaders, but
warned that only the energetic would dominate the world. These analysts seemed to be the
exact opposite. David laughed nervously as he entered, and Anna kept her eyes lowered as
she headed to the nearest seat. Neither seemed too thrilled to be there; Bob didn’t even want
to imagine how they would have performed in that seminar’s “primal scream” test.
David and Anna sat down, and Bob tried to break the ice with questions about their work. It
took him a while, but finally, they started to talk.
“Well, we create reports for management,” David said. “We aim to keep things accurate, and
we try to hand in our reports on time. It’s become something of a reputation,” he added with
a weak chuckle.
Bob realized that if he was going to make them talk, he’d need to give his famous speech,
summarized as, “your job in this meeting is to talk about your problem. Mine is to listen.”
After all, he needed to transform Halford company into a data-driven company, and they
were ones working closest with the company’s data.
Bob finished his speech with gusto, but Anna merely shrugged. “The management wants to
know a lot, but our possibilities are limited.”
Bob tried his best to look both in the eyes, though Anna turned quickly away. “But what is
it that prevents you from doing your work without any limits?”
“Our biggest challenge is the batch process from hell,” David spoke up suddenly. “This
notorious daily job runs overnight and extracts all data from the operational databases. It is
hugely complex. I lost count of how often this job failed over time.”
Got them, Bob thought, nodding in encouragement.
“And nobody knows why this job fails,” Anna jumped in. “But when it does, we don’t know
if the data is accurate. So far, there has never been a problem if we handed in a report with
questionable figures. But that’s probably because most managers ignore the facts and fig-
ures we provide anyway.”
“Exactly!” David threw up his hands. Bob started to worry he had stirred up a hornet’s nest.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1.2 The Halford Group 5
“When a job fails, it’s me who has to go to IT,” David said. “I just can’t hear anymore that
these nerds ran out of disk space and that some DevSecOps closed a firewall port again. All
I want is the data to create my reports. I also fight often with our security department. Some-
times, their processes are so strict that they come close to sabotaging innovation. Occasion-
ally, I get the impression they cut access to data sources on purpose to annoy us.”
“Often, we are asked if we want something more sophisticated,” Anna said, shaking her
head in frustration. “It is always the same pattern. A manager visits a seminar and comes to
us to ask us if we can ‘do AI’. If you ask me honestly, I would love to do something more
sophisticated, but we are afraid that the whole system will break apart if we change some-
thing. So, I am just happy if we can provide the management with the data from the day
before.”
Don’t get us wrong, ML and AI would be amazing. But our company must still master the
basics. I believe most of our managers have no clue what AI does and what we could do with
it. But will they admit it? Not a chance.”
Anna sat back in a huff. Bob did not need to ask them to know that both were applying at
other companies for jobs.
1.2.3 “CDO”
At lunch break, a skinny man in a black turtleneck sweater hurled into the office. He seemed
nervous, as if someone was chasing him. His eyes darted around the room, avoiding eye
contact. His whole body was fidgeting, and he could not keep his hands still.
“I am the CDO. My name is Cesario Antonio Ramirez Sanchez; call me Cesar,” he introduced
himself with a Spanish accent.
Bob was surprised that this meeting had not been announced. Meanwhile, his unexpected
visitor kept approaching a chair and moving away from it again as if he could not decide
whether to sit down or not.
“CDO? I have not seen this position in the org chart,” Bob answered calmly, “I have seen a
Cesario Antonio Rami …”
“No no no … It’s not my official title. It is what I am doing,” Cesar said dramatically. “I am
changing the company bottom up, you know? Like guerilla warfare. Without people like me,
this company would still be in the Stone Age, you see?”
“I am interested in everyone’s view,” Bob replied, “but I report to Alice, and I cannot partici
pate in any black ops work.”
“No, no, no …, everything is simple. Lots of imbeciles are running around in this company—”
Cesar raised his finger and took a sharp breath, nodded twice, and continued. “I know … HR
always tells me to be friendly with people and not to say bad words. But we have only data
warehouses in this company. Not even a data lake. Catastrófica! Its the 21st century, and
these dinosaurs work like in Latin America hace veinte años. Increíble!”
He took another breath, and then continued. “Let’s modernize! Everything! Start from zero.
So much to do. First, we must toss these old devices into the garbage, you know? And re-
place them with streaming-enabled PLCs. Then, modern edge computing services streams
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6 1 Introduction
everything with Kafka to different data stores. All problems solved. And then we’ll have a
real-time analytics layer on top of a data mesh.”
Bob stared at his counterpart, who seemed unable to keep his eyes or his body still for more
than a moment. “I am sorry, I do not understand.”
“You are an expert, you have a Ph.D., no? You should understand: modern factory, IoT, Indus-
try 4.0, Factory of the Future.”
Bob decided not to answer. Instead, he kept his eyebrows raised as he waited for what Cesar
would say next.
“So much potential,” Cesar went on. “And all is wasted. Why is HR always talking about
people’s feelings? Everything is so easy. This old company needs to get modern. We don’t
need artists, we need people with brains. If I want art, I listen to Mariachi in Cancun. If
current people are imbeciles, hire new people. Smart people, with Ph.D. and experience. My
old bosses in Latin America, you cannot imagine, they would have fired everyone, including
HR. Let’s talk later; I’m in the IT department en la cava.”
Bob had no time to answer. Cesar left the room as fast as he had entered it.
1.2.4 Sales
A tall, slim, grey-haired man entered the room, took a place at the end of the table, leaned
back and presented to Bob a salesman grin for which Colgate would have paid millions.
“I am Tom Jenkins. My friends call me ‘the Avalanche’. That’s because if I take the phone,
nobody can stop me anymore. Back in the nineties, I made four sales in a single day. Can you
imagine this?”
I get it; you are a hero. Bob thought. Let’s turn it down a bit.
“My name is Bob. I am a consultant who has been hired to help this company become more
data-oriented.”
Tom’s winning smile vanished when Bob mentioned ‘data.’
“I have heard too much of the data talk,” Tom said. “No analysis can beat gut feeling and
experience. Don’t get me wrong. I love accurate data about my sales records, but you should
trust an experienced man to make his own decisions. No computer will ever tell me which
potential client I should call. When I sit at my desk, I know which baby will fly.”
“With all due respect. I can show you a lot of examples of how an evidence-based approach
has helped clients to make more revenue.”
“Did you hear yourself just now?” Tom answered, “Evidence-based. You do not win sales
with brainy talks. You need to work on people’s emotions and relationships. No computer
will ever do better sales than a salesman with a winning smile. I’ll give you an example: One
day, our sales data showed that we sold fewer products in our prime region. Some data ana-
lysts told me something about demographic changes. What a nonsense!
So, I went out and talked to the people. I know my folks up there. They are all great people.
All amazing guys! Very smart and very hands-on. I love this. We had some steaks and beers,
then I pitched our new product line. Guess who was salesman of the month after that?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1.2 The Halford Group 7
No computer needs to tell me how to approach my clients. So, as long as we get the sales
reports right and we can calculate the commission, all is good. It is the salesman, not the
computer, who closes a deal.”
With that, The Avalanche was on his feet. He invited Bob to a fantastic restaurant—“I know
the owner and trust me, he makes the best steaks you’ll ever taste!”—and was gone.
1.2.5 IT
Ten minutes past the planned meeting start time, Bob was still waiting for the team member
he had heard most about upfront: the IT leader, Peter. His name had been mentioned by
various people multiple times, but whenever Bob had asked to know more about him, people
were reluctant to answer, or simply sighed and told him, “you’ll see.”
Finally, Peter stormed into the room, breathless and sweating. “This trip from my office in
the cellar to this floor is a nightmare,” he said between gasps. “You meet so many people in
the elevator who want something. I am constantly under so much stress, you cannot imag-
ine! Here, I brought us some sandwiches. I have a little side business in gastronomy. You
need a hobby like this to survive here. Without a hobby in this business, you go mad.”
Peter was a squat, red-faced man, who’d been with Halford since he was a lot younger, and
had a lot more hair. He sank a little too comfortably in his chair, with the confidence of a
man who’d been around so long, he was practically part of the furniture.
He doesn’t lack confidence, that’s for sure, Bob thought. I wonder how many dirty secrets this
man has learned over the years that only he knows.
“Okay, let’s talk about IT then,” Peter sighed after Bob turned down the sandwiches. “My
colleagues from the board and the executives still don’t get what it is they’re asking of me
daily. When they invite me to meetings, I often do not show up anymore. We are a huge
company, but nobody wants to invest in IT. I am understaffed; we hardly manage to keep the
company running. Want to go for a cigarette?”
“No, thank you,” Bob said, but Peter was already crumpled pack from his trouser pocket. He
rambled all the way to the smoker’s chamber, bouncing around from one topic to another.
Bob learned everything about Peter, from his favorite food over his private home to his
hernia, which was apparently only getting worse. Once Peter got first cigarette into his
mouth, he went back to the topic Bob was really interested in.
“The suits want things without knowing the implications. On the one hand, they want
everything to be secure, but then again, they want modern data solutions. Often, they ask
me for one thing one day, and then the very next, they prioritize something else. To be blunt,
I had my share of talks with these external consultants. If I allowed them to do what they
asked me to do, I could immediately put all our data on a file server and invite hackers to
download it with the same result. To keep things working, you need to firewall the whole
company,” Peter stubbed out his cigarette, and reached for another.
Bob leaped at the chance to interject. “Can you tell me more about your IT department? I was
looking for some documentation of the IT landscape. I have not found much information on
your internal file shares. Which cloud provider are you currently using?”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8 1 Introduction
Peter laughed and then started coughing. Tears in his eyes, he answered. “I told you, I’m
understaffed. Do you really think I have time to document?” He pointed to his head. “Don’t
worry, everything is stored in the grey cells up here. And we have a no-cloud strategy. Cloud
is just a marketing thing if you ask me. When we build by ourselves, it is safer, and we have
everything under control.
If I just had more people … Did you meet one of my guys, Cesar? He is also okay when he
does not talk, which unfortunately doesn’t happen often. I don’t like when people think they
are smarter than me. He doesn’t know Peter’s two rules yet. Rule Number 1: Do not get on
your boss’s nerves. Rule Number 2. Follow rule number 1.”
Peter laughed, flicked the second cigarette on the ground, and retrieved a bag from his other
pocket. It was full of caramels: Peter popped one into his mouth and continued, chewing
loudly. “Alice asked me if I could introduce you to Bill, my lead engineer, but I declined. This
guy has the brains of a fox but the communication skills of a donkey. He also gets nervous
when you look him straight in the eyes. I am always worried that he might wet his pants— Or
am I being too politically incorrect again? Our HR keeps telling me that I should be more
friendly. But in this looney bin, you learn to let our your stress by saying what you think. So,
please excuse my sarcasm. I am the last person standing between chaos and a running IT
landscape, the management keeps getting on my nerves with stupid requests, and last but
not least, the HR department is more concerned about how I communicate than about find-
ing the people who could help me keep our company running.”
It took a couple of attempts until Bob could finally break free from Peter’s complaining to
head to his next meeting. Even as he was leaving, Peter repeatedly called on Bob to visit his
food business sometime, where they could have a drink in private, and Peter could share his
Halford ‘war stories’ more openly.
1.2.6 Security
While waiting for the HR representative, Bob received a voice message from Suzie Wong,
the head of data security. When Bob played it, he heard traffic sounds in the background.
“Apologies for not showing up. School called me in as one of my kids got sick. I hope a voice
message is fine. I am Suzie Wong. I have been with Halford for years. They call me the hu-
man firewall against innovation. I take this as a compliment because, in some way, it means
I am doing my job well. Could any company be happy with a Head of Security who takes her
job easy? My predecessor was more laid back than I am. He was in his fifties and got a little
too comfortable, thinking he would retire in a secure job. And then one day … there was this
security breach. His kid’s still in private school, he’s suddenly without a job and, well, I’ll
spare you the details.
People often think I’m only around to sign off on their intentions to use data, but my real job
is protecting our client’s privacy. Data scientists must prove to me that our client’s data is
safe when they want to work with it. Unfortunately, too many take that too lightly.
If the requestor follows the process, a privacy impact assessment could be done within a
week. I will send you a link to our security portal later so you can review it. You’ll see for
yourself that we do not ask for anything impossible.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1.2 The Halford Group 9
I am the last line of defense, ensuring that we do not pay hefty fines because someone
thought it was just data they were playing around with. Some people also jokingly call me
‘Mrs. No,’ because this is my common answer if you cannot express why I should grant you
security exceptions or provide access to data containing clients’ private information. Some
people complain that this way, it may take months to get security approval. But so long as
engineers and data scientists still don’t get how to address security matters correctly, I don’t
care if it takes years before I give my final OK.
Anyway, excuse me now, I’m at the school …”
Bob had some time before his next meeting and looked up his next meeting partner online.
He discovered a middle-aged man with a long history on social media, including some ques-
tionable photos of his younger self in a Che Guevara t-shirt. Bob chuckled. That young man
could be happy that their interview wasn’t taking place during the times of the Cold War.
Finally, Bob’s interviewee entered the room. He was muscular, and his bushy black beard
showed the first signs of greying.
“My name is Hank. Pleased to meet you,” he said with a deep voice.
“I heard you are new in your position,” Bob said.
“Yes. Alice fired my predecessor because he was a tyrant. I am now one of the first of what
she calls ‘the new generation.’ I accepted because I can change things here now. Let me get
to the point: What are you planning to do?”
Bob smiled and said, “the idea in factories is often to use machine learning for automation.
Think of processes where people check the quality of an item manually. Imagine that you
can automate all this. A camera screens every piece, and defective items — which we call
‘rejects’ — are filtered out automatically.”
Hank stiffened. “My job is to protect jobs, not support removing them. Some of our factories
are often in villages, where they are the only source of work.”
“Almost every country goes through demographic changes. Can you guarantee that you will
be able to maintain a strong enough workforce to keep the factories running? How about
doing the same with fewer people?”
“But if you remove a few people, they can end up out of work,” Hank said. “What if you don’t
need workers at all in a few years? I don’t want to open the door to a system that makes the
bourgeoisie richer and put the ordinary proletarian out of work.”
“That is very unlikely,” Bob said.
“I see you are solidary with your employees, Hank. Did you consider exploring use cases
to protect them? We can use computer vision to see if factory workers wear helmets, for
example.”
Hank looked deeply into Bob’s eyes. Bob couldn’t quite tell if it was a good or bad sign, be he
did realize something: this was not a man he’d like to meet on a dark, empty street.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10 1 Introduction
“I understand that there might be benefits for my colleagues,” Hank said. “I just want to
open up a trojan horse: I get one IT system in to prevent accidents, and the next one makes
the workers obsolete. But I promised Alice I’d support her. She is a good person. I will talk
with my colleagues. I need to get them on board, but one thing is not negotiable: We will
never tolerate any system that completely replaces people who need the job they have.”
The next interviewee, an elderly woman with perfectly glossy, silver hair, entered the room.
She sat down and carefully ran her fingers over classic French bun, ensuring not a hair was
out of place.
“I am Annie from the complaints department,” she said with something of an aristocratic
tone. She seemed more interested in her neatly manicured nails than Bob as she went on.
“I honestly do not know why you want to talk to me.”
“Well, part of a data-driven enterprise is often also a customer-first strategy. We can mea-
sure customer churn and other metrics through data. Most of my clients want to use data to
maximize success. They even renamed their departments to ‘Customer Satisfaction Depart-
ment’ to underline this.”
“Aha,” Annie said. There was an uncomfortable silence as she polished the face of her an-
tique watch with her other sleeve.
Bob cleared his throat, anxious to get her attention. “Would you be interested to learn more
about your customers through data?”
“Why should I?”
“To serve them better?”
“We have sturdy products. Most complaints have no base. We believe the less money we
spend on confused customers, the more we have left to improve our products. This is what I
call the real customer value we provide.”
Ah-hah. Bob recognized the famous argument against investing in any domain that doesn’t
directly create revenue. She probably gets a bonus for keeping yearly costs low, he thought,
seeing an opportunity.
“And how do you keep costs small at the moment?”
“We have an offshore call center. They handle about 80 % of calls, although a lot of those
customers just give up, for some reason. The remaining 20 % are forwarded to a small team
of more advanced customer support employees. I know it sounds harsh, but you cannot
imagine how many confused people try to call us without having a problem at all. Some – it
seems – call us just to talk.”
“Right. And have you thought of the possibility to reduce costs by building chatbots backed
by generative AI? There are also many ways to use data science to filter customer com-
plaints. If properly trained, your clients get better support, and you reduce costs.”
“Would it be good enough to shut down the offshore center?
Gotcha. “If done right, yes.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1.2 The Halford Group 11
For what felt like the first time, Annie looked at Bob directly. “How much would it cost?”
“At the moment, it is still difficult to estimate.”
Annie thought a while, then stood up to leave. At the door, she paused. “Once you know, call
me immediately.”
1.2.9 HR
“I’m, I’m Pratima,” came a woman’s voice at the door. She approached Bob, looked up at him
with a welcoming smile and asked, “how can I help you, Bob?”
“Hi, Pratima. Let’s take a seat. As you know, I’m here to transform this company into a more
data-oriented one. I saw on LinkedIn that you have previously worked for very modern
companies with a strong data culture. How is it now to work for a company at the beginning
of its journey?”
“Alice asked me to be open to you. I took this job as a career step to advance to leadership.
However, the Wheel of Fortune led me to more challenges than expected.
In my previous job, we had the vibes to attract new talent. It was an environment primed for
excellence: fancy office spaces, a modern work culture with flat hierarchies, cool products
to work on, and many talented, diverse colleagues. Recruiting was easy because new candi-
dates felt it the spirit of our community.”
Pratima sighed.
“In this company, though, we cannot hide that we are at the beginning of our transition.
Applicants usually have many offers to choose from. Sometimes, we have to watch perfect
candidates walk away because we do not yet provide a warm and welcoming environment
for data professionals.
When managers discuss AI and data transition, some might oversee the human aspect.
What if you create the perfect data strategy but cannot attract enough talent? Many com
panies face this problem, and an elephant is always in the room. To become a data-driven
company, you have to create an environment that attracts people who think differently, and
this means changing your culture.
“Do you believe management is scared to promote too much change because it is afraid to
lose everything?”
“I understand that some seasoned employees might get disappointed and even resign if
their comfortable environment starts to modernize. But at the same time, if you do not
change at all, you are stuck in the mud, and your competition will make you obsolete. The
Dalai Lama says we should be the change we wish to be.”
“Right. And I believe it was Seneca who once said, ‘It’s not because things are difficult that
we dare not venture. It’s because we dare not venture that they are difficult.’”
“True! But I have to go now. I am looking forward to continuing our talks.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12 1 Introduction
1.2.10 CEO
Alice and Bob met at a fusion restaurant downtown in the evening. Alice introduced Bob to
Santiago, the long-time CFO turned new CEO. After an excellent meal, they ordered some
famous Armenian cognac, and got down to the real discussion.
“I’ll be honest with you, Bob,” Santiago began. “All your ideas to transform Halford sound
fantastic, but as an economist and a numbers person, my first question is, how much will
this all cost?”
Oh boy. Bob was prepared for the question, but he knew Santiago wouldn’t like the answer.
“It depends,” he said, and Santiago looked about as dissatisfied as Bob would have expected.
“I understand that everyone looks at the costs,” Bob continued, “but history is full of com
panies that failed to innovate and went bankrupt as their competition moved forward. If you
see the full spectrum of artificial intelligence, hardly any company will eventually operate
as before.”
“Some companies recommend that we start with data literacy workshops to enable leaders
to interpret data and numbers efficiently. Literacy sounds as if they want to teach us to read
and write again—and for a huge amount of money, of course. Don’t get me wrong, please.
I understand that we need to innovate, but if I approve everything consultants suggest to
me, we will soon be broke.”
“But if your leadership team cannot ‘think in data’,” Bob said, making air quotes as he
spoke, “how do they expect to attend our planned strategy workshop on exploring specific
data science options for our business goals?”
“What is the difference?”
“In the data literacy workshops, we aim to create an understanding of how to interpret data.
In the strategy workshop, we’ll create a list of use cases to improve processes in your com-
pany, and prioritize them, to integrate new data solutions gradually.”
“I understand that we have some tough nuts to crack. Some of our employees do not believe
in becoming data-driven, and we may need to invest hugely in Enablement. We once asked
external companies to help us modernize our IT. No consulting company gave me a quote with
a fixed price for a transition project. They always said we were facing a hole without a bottom.”
“Leadership is the only way to move forward. If the executive team is convinced and aligned,
this culture can spread.
Your operational IT will need to mature and modernize gradually. However, be aware that an
analytical layer can be built outside of corporate IT. One risk is to make data transition to an
IT problem; IT is part of it, but becoming a data-driven company is far more than giving
some engineers a job to build platforms.”
“For me, it’s clear,” Alice said. “Either we modernize, or we gradually fade out of existence.
Bob, what do you need to help us?”
Bob looked from one to the other, carefully considering his next words. “Becoming data-
driven does not mean hiring a bunch of data scientists who do a bit of magic, and suddenly
the company makes tons of money using AI. As I said, the first step is to align the stakehold-
ers. For me, this is the alpha and omega of AI: creating a data culture based on critical
thinking and evidence-based decisions. ”
“Great,” answered Alice. “Let’s get started with that.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
1.3 In a Nutshell 13
■ 1.3 In a Nutshell
Expectation Management
Most companies see the need to become data-driven, as they understand that
those organizations that ignore technical evolution mostly fail.
Some employees might have unrealistic expectations about how fast a tran
sition can go. We highlight that changing to a data-driven company is not just a
change of practices and processes, it is often a cultural overhaul of how the
company does its business.
Many employees fear having to give up some of their autonomy, or even losing
their jobs to computers entirely, if AI is introduced at their company. An organi-
zation that transitions to become data-driven must address this.
Technology Focus and Missing Strategy
Some companies try to find a silver bullet that solves all problems. “We’ll just
use this technology, just apply AI in this or that way, and all our problems are
resolved,” they think. Being too technology-focused, however, is an anti-pattern
that can hinder a company’s evolution to becoming data-driven.
Data Science and AI are about more than just Understanding Frameworks
and Methods
While it is essential to have a team of skilled data scientists and AI engineers
to pick the right AI frameworks and build complex AI systems, for large organi-
zations, there are many other considerations to watch for. Not being able to
understand the needs of an organization and where AI can make a difference is
a risk. With the wrong target, every strategy will fail.
Collaboration between Analysts and IT
In some companies, IT provides the platforms that analysts have to use.
If these platforms are error-prone or old, it can get frustrating for analysts.
In modern environments, not all analytical platforms must be managed by
one central IT department. This can give data teams more freedom to operate
on their own.
IT
Many IT teams lack the resources to build the data pipelines needed for data
science platforms. Often there is a gap between business users and engineers,
making it hard for them to communicate with each other.
IT, especially operations, is often focused on preventing problems. As a result,
many strive to protect their systems from change. They want to make it difficult
to access data in order to keep platforms secure. Data scientists, however,
would like to access data easily, to make progress quickly. This can lead to
friction between both teams.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14 1 Introduction
Costs
Introducing machine learning, data science and artificial intelligence can be ex-
pensive. It is rarely easy to say how much data science will impact the company’s
results: the relationship between inputs (such as time, effort, and resources) and
outputs is anything but deterministic. The alternative, however, is even more grim.
A company that is not ready to invest in innovation, will eventually lose its compet-
itiveness and risk bankruptcy.
Data and Privacy Protection
Data and Privacy Protect may slow down some projects and make them more
bureaucratic, but they are absolutely necessary. In addition, it’s most likely that
nobody wants to live in a system where privacy is not respected. In a day-to-day
job, privacy protection is process-driven: Making these processes transparent
and efficient.
Hiring
Introducing data science may require more significant changes in the corporate
structure or culture, which could reveal hidden conflicts and challenges nobody
wants to talk about.
Data professionals are a rare breed, and as there are few of them on the job
market, how can a company even think about change without the required skills?
Attracting engineers and scientists often requires an offer that goes beyond free
fruit in the office.
2
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
“We have two ears and one mouth, so we should listen more than we say.”
Zeno of Citium
How could the human ego of individuals affect the success or failure of AI
and data initiatives?
What is bias, and why does it matter?
How do you objectively find your data science and AI use cases and
goals?
How to conduct workshops to become more data-driven and how to
address common challenges?
Let’s start with a hypothesis: Every form of data science and, eventually also, artificial intel-
ligence, evolves from evidence-based decision-making based on critical thinking. Even if the
most extraordinary minds came together and were provided with endless resources, they
would fail if their work was based on incorrect conclusions, facts, and assumptions.
In Chapter 1, we introduce Tom, the salesman. As someone who does not trust technology,
he wants to make decisions based on gut feelings. Let’s imagine that Halford’s HR manager
hires a new salesperson, Cherry, who relies on facts and logic. Assuming she is equally
qualified and experienced as Tom, will she be more successful than him?
Science is about evidence. The scientific method shall eliminate subjective impressions or
intuition when exploring a topic. The data scientist’s job is to bring proof to the company-
proposed ventures and then, in a second stage, automate this evidence-making through
statistical inference using data pipelines and, potentially, machine learning models. If we
imagine an organization that is entirely data-driven and fact-based, we might think of an
entity that organizes its operation entirely to collect hypotheses about what is good or bad
for the organization and produces, in a highly automated way, proof and disproof. Relying
on gut feelings and intuition might be disregarded as outdated as medieval superstition.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16 2 The Alpha and Omega of AI
In later chapters, we will show that generative AI does not always provide evidence-based
information and that artificial intelligence can be biased, just like humans. Does this mean
Cherry, who loves logic and facts, can be as biased as Tom if she relies on AI?
Let’s start with a story to explore this hypothesis from a different viewpoint.
2.1.1 Bias
“We are a manufacturing company,” says Rachid. “Looking at all the cases I have investi-
gated, I see an enormous potential to reduce the number of defective items in our production
processes. I know the production team is still hostile to our work, but they must see the
value of what we do here.”
“Maybe this hostility is an excellent reason to start our first use case elsewhere,” Mahsa
replies. “If we screw up our first project and earn a bad reputation, it will take a long time
to win back the trust of the factory staff. We have allies in the customer complaints depart-
ment; they have a new boss who wants to prove herself, and she told me that she wants to
be the first department in Halford that is fully data-driven. We should start our data journey
by creating an AI chatbot to make our clients feel great again.”
“Come on,” says Rachid, “complaint management is not our core business. Let’s focus on
something that generates revenue.”
“We make money with happy clients,” Mahsa counters.
“Yes, and my data science cases will ensure that our clients get better products.”
“You said it yourself in the first few days: when it comes to data quality, it’s garbage in,
garbage out. I would not want to give the factory data to my worst enemy. It will take us a
month to bring the factory data to an acceptable level.”
“But no dull chatbot solves a business problem,” Rachid insists. “I worked with factory data
before. I know what I’m talking about. I understand you were among the best students to
graduate from your university, but what did you do besides exploring Language Models in a
university course? Do you have any practical experience?”
Mahsa shakes her head and adds loudly, “I worked hard to create this strategy with the new
complaints department manager. If we start with something else, all my efforts will be
wasted.”
***
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.1 The Data Use Cases 17
Although Mahsa and Rachid are just beginning their careers, this type of conversation also
happens with more experienced professionals. Both promote their ideas and ignore the
other’s opinions. Although their arguments are based on reasoning, they fail to provide
conclusive evidence. Ultimately, their goal is to gain a personal benefit: For Mahsa, it is to
nurture her excellent relationship with the complaints department manager, and for Rashid,
it is to utilize his experience from past projects. The discussion becomes more and more
emotion-driven. They may be stuck in a cycle of confirmation bias.
They might sleep on this conflict, and maybe the next day, they will wake up with a fresh
mind and change their views. Mahsa might acknowledge that Rashid had a point by saying
that production data is crucial for the company’s success. Rashid might agree that Mahsa’s
reasoning for starting in a different department might be the more risk-averse approach. In
the worst case, they’ll both remain stuck in their views, and their differences might result
in a feud unless they have a manager with enough leadership skills to help them resolve
their conflict.
Confirmation Bias
1
Copyright: https://www.simplypsychology.org/confirmation-bias.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18 2 The Alpha and Omega of AI
Bias is one of the first lectures students receive when learning about statistics.
It’s important because, in the end, we need to be first able to trust ourselves to
interpret data correctly before we can make assertions about our environment.
The data science life cycle usually consists of the following steps:
1. Create a hypothesis about what you can change using insights from data
2. Make a proof of concept (PoC)
3. Automate the solution
However, there can be debates until a solution is in production, having first achieved multi-
ple rounds of positive evaluations and feedback. Let’s assume Mahsa found a way to intro-
duce an AI chatbot to manage customer communication, and she demonstrated how clients
can communicate with that chatbot. Rashid might still object and claim to be able to bring
more value than Mahsa if he had gotten the budget for the first factory data use cases by
then.
Even if the first tests indicate that Mahsa’s innovation might reduce the demand for call
center employees, it might take months to verify whether overall customer satisfaction im-
proves as a result. Maybe Mahsa is right, and she will eventually become the hero of the
complaints department manager, and all the happy customers they’ll be able to serve. An-
other possible outcome is that Rashid’s hypothesis is correct: It does not matter if customers
chat with a chatbot or get answers from a human who does not immediately understand
their problems; What customers really want are better-quality products.
Exercises
One question in the dialog in Section 2.1.1 deals with the actual value for the company. In
this example, we talked about reducing the rejects of a factory run, but what does this mean
in numbers? A statement such as “Each day, we are reducing the number of rejects by 30%”
sounds excellent, but what does this mean without any reference?
In our example, economic improvement is the guiding principle. We need to get the data
concerning metrics connected with increasing long-term profits. One example of such a fi-
nancial metric is the return on investment, or amortization; Let’s try to understand it
using our sample factory example. With a new system, we can improve our production
pipeline and save around 1000 EUR daily, bringing us to a staggering 365,000 EUR annu-
ally. But what if the investment costs 2,000,000 EUR? Well, in approximately five and a half
years, this solution will have amortized. But what if the production pipeline is planned to be
shut down in five years as a new generation of factory pipelines will replace it? Another
question is opportunity costs. Imagine that for 2,000,000 EUR, I could reduce costs by more
than 365,000 EUR in another use case.
How confident are we that our figures are correct? Are we sure we haven’t overlooked any-
thing? Maybe even something that makes everything else obsolete? Which risks do we face
if something unexpected happens?
This is the moment where data science meets business expectations. Understanding com-
plex systems, how they interact, and the metrics that describe them is more than just a job
for a business analyst. A workflow for taking a data science case from beginning to end
could look as follows:
1. Identify a business goal: The management wants to increase factory automation to in-
crease quality and reduce manual efforts.
2. Create Hypotheses: Currently, employees manually check the output of our factory
pipeline for products with quality deficits. We hypothesize that “detecting rejects through
computer vision algorithms will reduce costs and improve quality.”
3. Provide a value proposition: We aggregate the hourly costs per line for manual quality
checks. We also estimate the costs of creating a system for automated quality checks and
its deployment on all lines. We determine that a return on investment is likely in three
years.
4. Create a proof of concept: We create a prototype, and using training data, the first re-
sults are promising. Based on our model and further learning during the PoC phase, we
conclude that our hypothesis and value proposition are highly accurate.
5. Deploy the first test version: We install video cameras in one of our smaller production
pipelines, and mechanical engineers add a mechanism that removes an item from the
pipeline if the computer vision model notifies them that it has deficits.
6. Continuously improving the product to maturity: Our first test version still needs
much improvement. We collect data to measure the efficiency of our algorithm and grad-
ually improve our solution. When our algorithm’s performance matches human workers,
we deploy the solution to various factories. The computerized system is still being im-
proved. Eventually, the new system will outperform the old approach and be deployed
everywhere.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20 2 The Alpha and Omega of AI
7. Have a retrospective: After our project is successful, we meet to discuss how we can be
even more successful in the future.
Skeptics may see the challenges that a company might face. What if the value proposition is
based on incorrect figures? What if the costs of this migration are miscalculated and sig
nificant cost factors are overlooked? How can a company’s executives be sure that the teams
implementing a use case are transparent about problems that could diminish the project’s
success? What about failed use cases? If a company loses 100,000 USD per failed use case
on average, it will need to gain 500,000 USD after five failed use cases in a successful use
case to break even. What if all the efforts do not pay off? In this chapter, we want to argue
why pursuing a transition to being more data-driven is still important, even though this
transition may be costly.
Exercises
R+D Departments
Besides, employees with unique personalities in companies are shaped by their past. Imag-
ine a company that had been hacked and lost some of its reputation and money. This com-
pany would take safety issues particularly seriously. The average age could also have an
influence. A company with more young employees could act differently than a company
with older employees on average. International companies are exposed to many cultures,
which could mean more opportunities and misunderstandings between colleagues than
with companies operating only in one country.
Education is an excellent example how the fear of change can prevent the
adoption of AI-driven processes.
We see what generative AI can do now, and it is not difficult to imagine what
next iterations of frameworks such as GPT might be capable of. AI algorithms
can perfectly analyze students’ performance, and provide comprehensive
feedback. They will not oversee mistakes because they are tired or are affect-
ed by private problems, and no student will ever be able to complain that they
got a worse grade because the AI did not like them as a person.
Generative AI will be able to give more detailed feedback than any teacher,
eventually. AI might isolate the exact knowledge that students are missing for
positive grades. They might even generate individual tests for students that
still have to grasp specific details: some language learning students could be
specially tested on their grammar, and others, on their vocabulary; some
math students may be analyzed as capable of understanding formulas but
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22 2 The Alpha and Omega of AI
lacking skills in solving complex problems, and they could be coached and
tested accordingly. With data it is even possible to determine when students
learn best, and which educational methods are most effective. Each student
can have an individually AI-optimized curriculum.
In a fully digitally transformed and AI driven school, teachers can see AI as
tool that helps them to succeed, and they may see themselves more like
mentors for students to support them to get most out of working with AI. Or
teachers may see AI as a threat, an inhumane system who degrades them
and takes their freedom. In the end, they might start a counter revolution with
the slogan: “Hey, AI, leave us teachers alone.” Regardless of whether their
rebellion is justified or not, they would call for a firewall against AI. And since
the educational system also includes students and parents, they also might
contribute a range of positive and negative views regarding AI-based learning.
When contemplating the possibility of fully AI-driven organizations, it is easy
to imagine what people might be afraid of. Employees might not exactly fear
having to fight the Terminator, yet they may still be anxious about losing their
authority and sovereignty. Jobs relate to status, and people take pride in
them. If surgeries are automated, the prestige of the ‘demigods in white’ will
diminish. If computers find ways to make everything superefficient, people
lose spontaneity and the freedom to be inefficient occasionally.
However, the harsh reality is that no society can ignore artificial intelligence,
unless they explicitly want to live without technology, such as the Amish do.
Many countries face demographic changes, and companies cannot find
enough qualified employees to replace those who retire. If not with automa-
tion through AI, how else shall we solve this problem?
AI, however, will disrupt the status quo. Some will welcome these disruptions,
while others will fear them. At the beginning of this section on culture shock,
we introduced a company with many individuals who each have their own
priorities and goals. Some might see an opportunity to get more influence
through a digital transformation, others might be afraid of it, and might even
try to sabotage it. The small political games that some employees play may
turn out to be the biggest challenges some companies will ever have to face.
In his book, “The Geek Way,” Andrew McAfee suggests that success in an innovative envi-
ronment requires a specific mindset.2 This mindset, he argues, must include the following
attributes:
1. Passion and depth of knowledge: Geeks often have deep knowledge in specific areas,
and this can be a massive asset in professional environments that value expertise and
detailed understanding.
2. Creativity and problem-solving: The ability to think outside the box, often seen in geek
culture through gaming, fantasy, and sci-fi, is valuable for innovative thinking in the
workplace.
2
https://www.gartner.com/en/articles/what-s-new-in-artificial-intelligence-from-the-2023-gartner-hype-cycle
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.2 Culture Shock 23
3. Technology savviness: Given the digital nature of the modern world, the book empha-
sizes the significant advantage of being tech-savvy, a common trait among geeks.
4. Community and collaboration: McAfee also discusses how the communal aspects of
geek culture, like sharing knowledge and collaborating on projects, translate well into
team-based work environments.
Exercises
■ 2.3 Ideation
As a rule, successful data science use cases lead to at least one of the following three out-
comes:
Reducing costs,
making faster decisions,
or penetrating new markets.
We’ve seen that personal opinions and feelings are insufficient to decide on the appropriate-
ness of a data science case. Still, to get started, we must ask the right questions about what
we want to do.
In this section, we’ll show you how to organize ideation workshops to find the right ideas,
and how to group and prioritize them by relevance. Lastly, we will show you how to test
them.
Everything starts with a strategic goal that gives a data science team direction. On the high-
est level, goals are similar in different organizations: Make more profits, be greener, make
your employees happier, and so on. After going into the details, the strategy becomes more
precise. Here are some examples:
Telecommunications: “We want to increase revenue by reducing customer churn.”
Manufacturing: “Due to demographic changes, we need to automate. We want to produce
more in our factories, with fewer people.”
Automotive: “We want to have level five autonomous driving capability. We want to solve
all obstacles to reach our goal within the next two years.”
Financials: “We aim to be the best investment broker and give our clients better insights
into securities and bonds than our competitors.”
Any company: “We want to reduce CO2 emissions by 50 %.”
Understanding the context of a goal—for example, the business setting—is essential to ap-
proaching a solution.
Knowing its goals, a company can organize workshops and invite domain experts, data sci-
entists, and other contributing parties. The aim is to brainstorm. It makes sense to have this
workshop outside the usual business settings. The teams should experience an environ-
ment which helps them be creative and find solutions to achieve the company’s goals.
With the goals in mind, the teams can start investigating current processes and workflows.
Then, they can start creating hypotheses, which can still be based mainly on intuition and
personal beliefs. Such assumptions may be:
“I believe costs are the primary concern of our clients. If we guarantee our customers that
we will not increase our service costs, they will stay loyal.”
“I see a lot of rejects in our factory pipeline. If we find a way to reduce the number of
rejects, we can save money.”
“Our head office is ancient. It must be a source of energy waste. We should be able to do a
lot to improve its energy efficiency, increasing our reputation as a responsible company.”
“We have a complicated sales process and will sell more if we simplify it.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.4 Design Process Models 25
These hypotheses are still personal assumptions and highly subjective. The next step is to
collect the data and look for facts to prove or disprove our hypotheses.
Organizing a Workshop
Ideation workshops can help bring clarity into a company’s data strategy.
Organizing them often requires a lot of preparation to ensure success. It
makes sense to pick a location where people can relax, away from the
office. Events need to be adjusted to the audience. Team building games
can help people to open up, but some teams are not made for games.
It is also essential that there is enough material available for the workshops,
such as cardboard or crayons. Many hotels provide rooms for seminars, and
they are commonly well-equipped.
The Dalai Lama is quoted with saying, “when you talk, you are only repeating
what you already know. But if you listen, you may learn something new.”
Some people have the reputation that they only talk just to hear themselves
talking. But if the same people who have been talking a lot in the past, keep
talking, why should things change? In such constellations, it can be helpful
to work with an external moderator.
Exercises
How can you ensure within a workshop that introverts talk more, and
extraverts talk less?
Cross your heart! Are you typically the one who talks the most? How do
you stop yourself?
In your view, what would be the perfect environment for a workshop to
envision new strategies for a company?
Do you believe that inveterate pessimists should be excluded from vision-
ary workshops, as their negative views might kill every creative process?
Or are the pessimists the only force that stands between real solutions
and the castles in the air created by inveterate optimists left alone?
solution. You can pragmatically adjust processes to your organization’s needs. Like with
every creative process, success in crafting a data process model depends significantly on
putting the right people in the right setting and creating the right atmosphere to work on the
right problems.
There are various design process models, each with pros and cons for different situations.
We will now explore two of the most common models: Design Thinking, and the Double
Diamond framework. Then, we discuss how such workshops can conducted.
Design thinking is a non-linear, iterative process, which teams use to understand users,
challenge assumptions, redefine problems, and create innovative solutions to prototype and
test.3 It involves five phases:
1. Empathize: Understand the needs of the user.
2. Define: Clearly articulate the user’s needs.
3. Ideate: Brainstorm potential solutions.
4. Prototype: Create a model solution.
5. Test: Try out the solution and refine it based on feedback.
This leads us to another hypothesis. If we need such a method, one of our problems might
be that we believe in knowing what our users or customers want. A cynic might call us now
“wanna-be mind-readers.” If we look this behavior up in the context of psychology, we might
explore consensus bias.
It is very tempting to say, “I have seen our colleague’s work, and occasionally, I chatted with
them during a break. I know what they need to work better.” Such an attitude often results
in a failure to truly grasp a user’s needs. Using the design thinking methodology, however,
forces us to take our clients seriously and challenge ourselves to use a systemic approach to
understand the users’ requirements.
Design thinking is often seen as less structured than the double diamond model, which
we’ll learn about next. It gives a team more freedom and is more user-focused. But from an
entrepreneurial viewpoint, is being focused on the customer not the ultimate recipe to be
successful?
3
https://www.interaction-design.org/literature/topics/design-thinking
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.4 Design Process Models 27
The British Design Council announced this design process model in 20054, adapted from
the divergence-convergence model proposed in 1996 by Hungarian-American linguist Béla
H. Bánáthy.5
The two diamonds represent exploring an issue more widely or deeply (divergent thinking)
and then taking focused action (convergent thinking). Readers interested in positive psy-
chology might see parallels with Barbara Fredrickson’s Broaden and Build theory.6
The double diamond suggests that the design process should have four phases:
1. Discover: Understand the issue rather than merely assume it. This involves speaking to
and spending time with people affected by the problems you want to solve.
2. Define: The insight gathered from the discovery phase can help to define the challenge
differently.
3. Develop: Give different answers to the clearly defined problem, seeking inspiration from
elsewhere and co-designing with various people.
4. Deliver: This involves testing different solutions at a small scale, rejecting those that will
not work, and improving the ones that will.
Like design thinking, double diamond forces teams to explore what users really need as a
team. In many cases, it turns out that the know-it-alls, who want to design systems solely
based on their own authority, wits and experience, are not aware of many details that regu-
lar users are worried about.
Exercises
4
https://www.designcouncil.org.uk/fileadmin/uploads/dc/Documents/ElevenLessons_Design_Council%2520%
25282%2529.pdf
5
https://www.designcouncil.org.uk/news-opinion/what-framework-innovation-design-councils-evolved-double-
diamond
6
https://positivepsychology.com/broaden-build-theory/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
28 2 The Alpha and Omega of AI
7
https://www.rice.edu/jfk-speech
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.4 Design Process Models 29
tactical ones for specific business lines. Managers then create individual goals for their
team members to help them contribute to the broader strategic goals.
Automation initiatives may target departments where value is created for clients. By stan-
dardizing processes, overall quality also improves. Employees are freed up to focus more on
the innovation of forthcoming products. HR might set specific goals to be an inclusive em-
ployer and to increase diversity, as this supports all goals of being perceived as a modern
company. Sales goals might include penetrating new markets and manufacturing depart-
ments wanting to increase product quality.
The critical element in the AI era is that AI or data science can be decisive in reaching
almost every organization’s goals. This makes AI powerful for companies and creates
market risks if ignored. This also includes not-so-obvious scenarios, such as HR goals. Using
data, every HR department learns about the perception of a company on the market, and
they can better identify possible friction. Some countries might forbid the assessment of the
mood of employees, though. However, consider the value for companies that value inclusion
if algorithms can detect hate speech or discrimination in employees’ behavior.
The question is not whether AI and data can help but whether the company is ready
to utilize AI and data. As mentioned before, transitioning to becoming more data- and AI-
driven sometimes requires people to leave their comfort zones. Some organizations, there-
fore, connect goals with bonus payments. Money can be a solid incentive to encourage em-
ployees to explore unknown territory, which might initially seem unpleasant to some of
them. Still, more is needed than only bonus payments; strong leadership skills are also re-
quired to ensure everyone contributes to the common goals.
The more clearly a company’s strategic goals are defined, and the better the team is aligned
to accomplish them, the more successful AI and Data Science workshops will be. This again
highlights how crucial top management support is and that it may be better to develop an
inclusive strategy before starting with workshops for AI and data science use cases.
SWOT Analyses
Note also that some believe SWOT should be replaced by TOWS, which
represents a better order of execution: company’s ought to first examine
Threats and Opportunities in the market, in order to best contextualize their
own Strengths and Weaknesses. Failure to do so can result in “abstract,
navel-gazing discussions,” as Michael D. Watkins, a key TOWS advocate,
puts it.”8
2.4.3.2 . . . to Execution
Companies often have cross-functional teams to accomplish strategic goals. Ultimately,
while the data scientists may report to a data science manager, they are usually distributed
to projects where they work with members of other business units. Assuming one goal is to
automate the production process to reduce the demand for 20% of employees, a cross-func-
tional team of factory workers, data scientists, and other stakeholders will be created.
Teams usually meet in workshops to solve problems using design processes based on mod-
els such as those introduced in this chapter. One ingredient for success is that all stakehold-
ers are represented in workshops. If a company wants to optimize the claims process in an
insurance data science process, this workshop requires the involvement of customer repre-
sentatives who work on cases in their daily business. If a company wants to automate work
in a factory to reduce staff size, data scientists need a detailed understanding of factory
processes in general, as well as processes specific to that factory. After they gain this in-
depth knowledge, they can start creating their first hypothesis on how the factory could
produce the same or even greater outputs with fewer people. Here, again, data scientists
need to be paired with domain experts. The input of those who are working with processes
daily is invaluable; Without them, workshop conclusions are merely based on speculation.
This is why avoiding the common mistake of representing domain experts by a manager
with no detailed domain experience is crucial.
Companies will approach execution differently based on their experience and level of data
maturity. Early in their data journey, companies might start by making results measurable.
Let’s look at the goal of reducing CO2: To which level of detail does a company already know
how much CO2 they produce? Are the reports based on analytical models that estimate the
output based on the industry’s size? Are there more ways to measure the actual production?
Could more refined measurement processes be used to more clearly show which exact pro-
cesses in the value generation chain produce CO2?
Many companies hope for low-hanging fruits during these workshops, and sometimes this
is possible. Especially in day-to-day business, there may be easy-to-implement changes that
bring immediate benefits. This is especially true thanks to modern cloud infrastructures,
which allow companies to quickly spin up data platforms to explore data. But in most cases,
it takes time to deliver value using data science. For example, it might take some time to
collect all the required data to start with explorations. The duration needed for the first
visible success can be discouraging for companies with a low budget for data science.
8
https://hbr.org/2007/03/from-swot-to-tows-answering-a-readers-strategy-question
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
2.4 Design Process Models 31
Some solutions require more engineering, some less. There is no bulletproof approach to
solve all problems. Some approaches may also require additional software and engineering
efforts. To explore changes in a factory to reduce the necessary number of staff or the
amount of CO2 emitted, for example, digital twins may give data scientists a base to exper
iment. However, digitizing an entire factory to simulate changes might take a while if a
company is at the beginning of its data journey. Contrast this with an insurance company,
for instance, which wants to explore how generative AI can help reduce the demand for
humans processing insurance claims. In this case, data scientists can start experimenting
early.
To sum it up, the more we explore the details of data science use cases, the more different
the approaches become. Teams will run into specific problems. It might take a considerable
time to get access to the required data. Existing software applications in a company might
need to be adjusted or updated to provide the data in a format the data scientists need. In
other cases, data owners must be convinced to release the data. Occasionally, the informa-
tion security department requests a lot of details about the proposed data science use cases,
slowing down the speed of exploration. Getting everyone to collaborate sometimes requires
convincing and negotiation skills, especially if the “data culture” has not yet been estab-
lished.
Companies that are more advanced in their journey will integrate AI applications faster. It
will become more natural for them to use machine learning to detect and react to outliers.
They may use various algorithms to predict customer behavior and improve sales processes.
Still, all mature data companies trust facts derived from data because, ultimately, without
evidence-based approaches and critical thinking, exploring data would make no sense.
Exercises
What does “management talk,” including phrases like “being the leader in
AI in our industry” or “being the most popular employer in the region”
stand for? Imagine your job is to introduce data science in a company, and
you are confronted with vague or nonexistent goals: how would you deal
with that?
Look up “data maturity” or ask a GenAI bot about this topic. Try to learn
about different maturity models that define how mature an organization is.
Where do you see your organization, according such a model?
Some employees might not see how they could benefit from a well-defined
data strategy. How would you ensure that these employees will still collab-
orate in strategy design workshops?
How would you describe the culture in your company? Are decisions made
top-down from management to ground floor, or does your company have
flat hierarchies? Do ideas for data projects actually flow from the bottom
up, pushed by data and tech enthusiasts?
During a workshop, how would you handle it if some participants were not
as open-minded as the others?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
32 2 The Alpha and Omega of AI
Identify domain experts who have worked for many years in a field that is
unfamiliar to you, and ask them detailed questions about their job.
Do you think it makes sense to include in a workshop team someone who
only moderates, and who is completely impartial about results? What could
be the pros and cons?
What would you recommend to companies with a low budget for data
science? Is it good to keep going with a low budget? Would they be better
to avoid investing in data entirely, given that working with a low budget will
likely not yield results at all? Or should they become more risk-friendly,
and try to raise investment money to speed up their data transition?
If you are working for an organization, are you aware of your company
goals? Do they fulfill the SMART criteria?
Do you agree with the highlighted statements in the text “the critical ele-
ment in the AI era is that AI or data science can be decisive in reaching
almost every organization’s goals” and “the question is not whether AI and
data can help but whether the company is ready to utilize AI and data”
In the beginning of the chapter, we mentioned that skeptics could perceive
data transitions as expensive with a risk that many use cases fail. Even
those who promote that a company shall transition to become data-driven
might not deny that. At the same time this chapter hints that companies
who ignore AI will likely cease to exist. What is your opinion on that? Is it
possible to answer “AI skeptics” that ignoring AI might be the bigger risk?
Workshops can be like emotional roller coaster rides with many ups and
downs
Sometimes workshops are dominated by extreme positions from ‘we change the world with
data’ to ‘no matter what we do, it will end up in a disaster’. Extreme positions are common,
and conflicts between extreme positions are frequent. The inveterate pessimism of one side
might frustrate those who are anxious to finally speed up the company’s innovation cycles
to become a leading data-driven company in their domain.
One way to resolve this is to understand the perspectives of each side. Put yourself in the
shoes of an operations manager whose job is to keep a company running and who receives
calls from angry users and managers if applications are down. In the worst cases, they
might face weekend or night shifts to get things running again. For them, too much change
can be perceived as a threat.
In parallel, you can imagine yourself as a businessperson hired to innovate the company.
You may be aware of some strategic assessments of your company and that your company is
at risk of falling behind your competitors due to the slow execution of innovation. For them,
everyone who slows down innovation risks the long-term success of the company.
The only way to break up stalemates between extreme positions is by ensuring that they try
to understand the other side’s position. Without those who push for innovation, there would
be no progress because, in the end, there are many examples in history in which visionary
thinkers changed the world against all criticism. However, sometimes visionary thinking is
just like building castles in the air, and pessimists can help us identify those ideas that can
last, and those which would be a waste of time and money to pursue.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
34 2 The Alpha and Omega of AI
Many approaches suggest taking different flexible viewpoints9, 10, 11. The idea is to see every
viewpoint as a flexible role. As a pessimist, an individual’s job is to find reasons why an idea
might fail. However, the same individual can be given a new role. After playing the devil’s
advocate, they can switch roles and attempt to come up with arguments about why the solu-
tion will succeed.
It may take some time to get used to the idea that a viewpoint is not fixed and can be
changed via role switching, but it can help break up extreme positions.
■ 2.5 In a Nutshell
Bias
Bias is a natural phenomenon, and it is human to be biased. However, it is
possible to overcome one’s own biases. If we are biased, we cannot be
objective. To be able to succeed with data projects, one needs a culture that
focuses on facts. To focus on facts, one needs critical thinking. Critical
thinking needs awareness about bias.
SWOT and Data Maturity Analyses
An analysis of a company’s data maturity and their Strengths, Weaknesses,
Opportunities and Threats may help them to define their strategy. This in
return also leads to data science projects that address the company’s most
important challenges.
Data Literacy
Data literacy is about understanding the numbers we are looking at. The
moment we understand which numbers are important, we can work with
them.
Design Process
There are many processes that can help to define data products. Double di-
amond and design thinking are just examples, which we examined in detail.
9
https://positivepsychology.com/broaden-build-theory/
10
https://en.wikipedia.org/wiki/Disney_method
11
https://en.wikipedia.org/wiki/Six_Thinking_Hats
3
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Cloud Services
Stefan Papp
■ 3.1 Introduction
This book discusses how artificial intelligence, machine learning, and deep learning
will impact our lives. Artificial intelligence has recently become a popular topic in the me-
dia, and with good reason. The AI revolution is now a prominent talking point, and for
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
36 3 Cloud Services
a lmost everyone interested in this field, Sam Altmann has become a household name. His
company, OpenAI, is often presented as the critical enabler of generative AI. Companies like
Microsoft or Google are seen as those that will eventually bring AI to the population.
While the companies that work on the software-related aspects of AI have received world-
wide attention, the companies that produce the hardware on which AI runs, and who are
thus equally responsible for many of AI’s recent success, are the unsung heroes. That is,
until we examine the success of such companies, like Nvidia, on the stock market. But while
Nvidia have made their millions producing GPUs, the hardware requirements for AI and
Data Science go far beyond this. In data-driven approaches, data professionals explore enor-
mous amounts of data, sometimes up to petabytes. A considerable amount of hardware
resources are necessary to host and process this data.
In this chapter, we discuss how we can use cloud infrastructures for our data projects. We
will explore infrastructure as a service (IaaS) solutions that give us more freedom to in-
stall our own applications and tools we want to use. We will also look at how we could use
preselected packages through a platform as a service (PaaS) of a cloud provider.
In addition, some data science projects require expensive GPUs. Once they are
purchased, the next generation of better GPUs may be available. As a cloud gets
constantly upgraded with the latest hardware, a consumer of cloud services
never falls behind, whereas it is not guaranteed that on-premises installations
are upgraded regularly.
Operating systems and their drivers get updated on a regular basis. Linux oper-
ating systems, for example, with more modern kernels —the core functionality of
every operating systems—and drivers are often optimized for modern hardware
and provide faster throughputs with data transfers and better performance
using specific algorithms. Operating system vendors try to keep their software
backwards compatible as long as possible. This means software that ran on
older versions of an operating systems, can be executed as well on newer ver-
sions. The goal for software vendors is to serve as many customers as possible.
If the programmers optimized the software using the cutting edge features of
the latest kernels and drivers, many clients might not be able to run this soft-
ware due to policies that forbid to install the latest versions of an operating
systems. This is likely to happen as many corporate IT policies tend to be con-
servative and the companies update their servers late to newer versions as they
want to limit the amount of operating system version they maintain in parallel.
Cloud providers can use the latest drivers and OS kernel versions for their
native cloud services. They are not stuck to old operating system components
for compatibility reasons. With that they can use the latest features and provide
an overall better experience to the users.
Many experts see the cloud not just as a new technology but, rather, as an alternative busi-
ness model for procuring and managing hardware itself.1 It may not be a specific service,
such as a cloud-based database or cloud-based file storage, that attracts new clients. A cloud
provider likely wins a new client when an organization starts believing in the advantage of
renting IT resources over owning them.
Statistics show that most companies use or will use the cloud eventually.2 Although it might
still be preferable for some companies to shield operational core systems in on-premises
environments, most companies already prefer the cloud for analytical workloads. In addi-
tion, the cloud is also a perfect experimentation lab for data scientists, as they can dynami-
cally generate and decommission the resources they need.
Every cloud provider defines what cloud is slightly different.3,4,5 Some value propositions are
also directed to different target groups. For instance, accountants love to hear that they can
reduce fixed expenses as they do not want to spend much money upfront to buy a new plat-
form. A pay-as-you-go model makes things easier for them. Technical teams will see the
benefit of deploying new solutions fast and scaling them on demand.
1
https://medium.com/@storjproject/there-is-no-cloud-it-s-just-someone-else-s-computer-6ecc37cdcfe5
2
https://www.cloudzero.com/blog/cloud-computing-statistics/
3
https://docs.aws.amazon.com/whitepapers/latest/aws-overview/six-advantages-of-cloud-computing.html
4
https://microsoft.firstdistribution.com/wp-content/uploads/2021/08/Microsoft-Azure-Value-Proposition_2-1.pdf
5
https://cloud.google.com/why-google-cloud
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
38 3 Cloud Services
Before we go deeper into the functional services of a cloud provider, it is essential to high-
light the amount of data centers of cloud providers worldwide as one defining factor of a
cloud provider’s power. Customers can scale their services globally with the Big 3 — Ama-
zon, Google, and Microsoft.6,7,8 Especially for smaller cloud providers, it is hard to compete
with them as they do not have the resources to build data centers worldwide.
This value proposition is generic and can also be applied to other cloud providers, who may
have their own variants of it.
3.2.1 XaaS
The “everything-as-a-service” (XaaS) model outlines different scenarios and defines who—
that is, the customer or the cloud provider—is responsible for what. The four best-known
XaaS models are:
Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)
Function as a Service (FaaS)
6
https://aws.amazon.com/about-aws/global-infrastructure/regions_az/
7
https://azure.microsoft.com/en-us/explore/global-infrastructure/geographies/#geographies
8
https://cloud.google.com/about/locations
9
Copyright: Gartner, Inc. (https://www.zdnet.com/article/xaas-why-everything-is-now-a-service/)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.2 Cloud Essentials 39
The October 2023 Gartner Quadrant reflects the state of the cloud vendor market, as seen
by this globally respected consulting firm. This analysis of the globally respected strategy
underlines the dominance of the Big 3, which is also reflected in this Nasdaq study.10
Challengers like Alibaba Cloud or IBM could catch up in the next few years. Geopolitical
tensions also create specific markets for cloud providers in particular countries. For exam-
ple, Chinese-based cloud providers might face more challenges in some countries and, at
the same time, fewer difficulties than their US-based peers in other countries.11
10
https://www.nasdaq.com/articles/the-big-3-in-cloud-computing
11
https://www.andrew.cmu.edu/user/hongs/files/Alibaba_ICS2022.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
40 3 Cloud Services
Figure 3.2 Magic quadrant for cloud infrastructure as a service (worldwide), 202312
Each cloud provider offers similar services, which may differ in detail but still solve a simi-
lar use case. Table 3.1, which represents a minimal subset of this rapidly growing ecosystem.
Table 3.1 Cloud providers and one of their reference solutions for various cloud needs.
Service AWS Azure Google Cloud
Storage Amazon S3 Azure Blob Storage Cloud Storage (GCS)
Compute Amazon EC2 Azure VM Compute Engine (GCE)
Development AWS Code Commit Azure DevOps Google App Engine (GAE)
Managed Kubernetes Amazon Elastic Kuber- Azure Kubernetes Google Kubernetes
netes Service (EKS) Service (AKS) Engine (GKE)
Database (NoSQL) Amazon DynamoDB Azure Cosmos DB Firebase
Data Warehouse Amazon Redshift Azure SQL BigQuery
Serverless Computing AWS Lambda Azure Functions Cloud Functions (GCF)
12
Copyright: Gartner, Inc (https://aws.amazon.com/blogs/aws/read-the-2023-gartner-magic-quadrant-for-strategic-
cloud-platform-services/)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.2 Cloud Essentials 41
Cloud providers may differ in their specialized, new services. However, one rule of thumb is
that every long-established standard service provided to clients for years provides enough
features to satisfy the demands of ordinary cloud customers.
Storage Services
Every cloud provider offers a file hosting service to provide customers with unlimited data
storage and complementary services to access this storage through various channels. These
services are often called “object storage” to distinguish them from file systems. Object stor-
age services frequently add features, such as hosting a web page on top of this file storage.
Also, objects in an object are immutable; you cannot edit anything in object storage; you can
only overwrite it. File system services, in contrast, often have higher throughput require-
ments to read or write data than file hosting services. The core criterion for both is to keep
data availability high and costs low.
Cloud providers intermix storage solutions with other building blocks. For example, for da-
tabase solutions, the cloud provider may use its storage solution to store the data physically,
while the corresponding database solution also contains a database engine.
One quality indicator for cloud storage services is durability, which measures how healthy
and resilient your data is when stored on the cloud. The Big 3 promise 11 nines of durability
or higher. This corresponds to an average annual expected loss of 0.000000001% of objects.
For example, if you store 10,000,000 objects, you can, on average, expect to incur a loss of a
single object once every 10,000 years.
Compute
The most common use case is to instantiate virtual machines (VMs), where you may install
software as you want.
Imagine a user who wants to run some specialized software for data analysis from a vendor.
They are looking for a hardware configuration that matches their needs. They might find
specific templates for data science, such as GPU-optimized instances. The user can also se-
lect an operating system for the VM.
The most significant difference between your local data center, which a middle-sized com-
pany may run, and the big cloud providers is the configuration available. The big cloud
providers offer users the latest and most powerful GPUs in server farms, which many
smaller competitors cannot afford. When big cloud providers build new data centers in a
new country, they can use the experience of what has already worked in many countries for
years. They optimize every tiny detail, how each component talks, and how servers commu-
nicate with each other over the network.
Databases
Companies can host open-source and almost any proprietary database in the cloud. Each of
the three major cloud providers also offers its database services in the cloud. Microsoft’s
product is Azure SQL, a cloud-based version of MS SQL Server. CosmosDB is a complemen-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
42 3 Cloud Services
tary NoSQL database. Amazon has several proprietary systems, DynamoDB and Redshift
being the best-known representatives. DynamoDB scores with NoSQL capabilities, while
Redshift is a data warehouse. Google’s solutions are BigTable and BigQuery. BigTable is the
technology that also gave rise to a famous representative of a database for Hadoop: HBase.
The cloud’s database systems support the polyglot persistence paradigm.13 Specific require
ments need specific solutions. As a result, individual storage solutions are designed to solve
one specific use case.
Metadata Storage
Data
Storage
13
https://martinfowler.com/bliki/PolyglotPersistence.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.2 Cloud Essentials 43
Development
Google provides many services around Kubernetes, the de facto standard for container
orchestration. As Kubernetes became very popular, using the creator’s cloud platform can
be a decision criterion for some companies.
Microsoft has a long history of creating software development tools and offers a good inte-
gration of cloud services into their products for developers, such as Visual Studio Code.
Directories
Microsoft uses Active Directory, which helps them attract customers who are already used
to Windows-based environments in their on-premises setup. This can be a decisive decision
criterion when a customer is already a Microsoft customer.
AWS Identity and Access Management (IAM) is a directory service that manages users
and resources for organizations that subscribe to Amazon’s cloud services. You can use IAM
to create and group users and control their access to virtually any AWS service, such as EC2
or S3.
Exercises
Get certified. In preparation for a provider cloud certification, you will learn
everything you need to work with the cloud professionally. There are also
specific certifications for machine learning and other analytics-related
skills.
Study policies, security groups, and firewall settings in-depth. You’ll thank
yourself when you’re pressed for time and need to solve an access problem
while you’re about to demo.
Experiment with a cost calculator to get a feel for expected costs of a
platform in the cloud.14 Would you recommend a company move its
on-premises architecture to the cloud via lift and shift?
Explore various cloud providers and find out where they have their data
centers. Which impact do you think it has when a cloud provider is the first
to build a data center in a specific country?
14
https://calculator.aws/#/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
44 3 Cloud Services
Two main architectural building blocks for cloud-native solutions are microservices and
serverless architectures.
Microservices
Microservices encapsulate functionality to perform a single task. Some developers already
know this idea from the Single Responsibility Principle. In a data project, such a function-
ality could be to extract data from the data warehouse and store the information as a file on
the cloud. To build microservice architectures, developers must package atomic functional-
ity in a container, which encapsulates functionality, and a mini operating system to execute
it.
Kubernetes is an orchestration tool for containers. It controls the automatic instantiation of
containers and the replacement of broken ones. Kubernetes runs are distributed on several
nodes and use two categories of services. The control plane manages processes and inter-
actions between processes. For example, new requests are received via a Rest API, and the
control plane coordinates appropriate actions. Kubernetes workers—the second category—
then execute the functionality. A pod is a unit that hosts one or more containers.
The great thing about Kubernetes is that once engineers deploy a new microservice version,
nobody needs to update software on host servers. Kubernetes also acts as a secure sandbox
for execution in case a container encapsulates malign code.
15
Copyright: Kubernetes, https://kubernetes.io/docs/concepts/overview/components/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 45
Serverless Computing
With serverless computing, the cloud provider offers a service that allows users to run
functionality without worrying about managing servers. Customers only pay for computing
time; downtime is not a cost.
Serverless computing is designed for use cases that need elasticity — where the number of
events that trigger processing varies greatly. Whether massive or tiny, the requests shall be
served in time. Imagine a service that creates a monthly statistic at the end of each month.
Having a permanent running physical server that executes this job only once a month
would waste resources.
All three significant vendors have proprietary serverless systems: AWS Lambda, Azure
Functions, and Google Cloud Functions.
Exercises
16
https://www.serverless.com/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
46 3 Cloud Services
We must briefly explain hardware, Linux, and infrastructure as code to work with infra-
structure as a service.
3.3.1 Hardware
To run jobs in the cloud, data professionals must define the hardware configuration of the
services on which they execute their jobs. Even though cloud subscribers do not purchase
the hardware and can cancel cloud services anytime, configuration mistakes can still be
costly.
In many cases, while presenting potential data use cases to decision-makers, the team also
needs to estimate the OPEX costs of maintaining operational services in the final stage of
the use case. Getting the hardware requirements wrong may have an impact on that estima-
tion.
Costs can be optimized through reserved and spot instances. In short, with
a reserved instance, you commit to using compute power and prepurchase
capacity at a lower price. So, if you know that you will have a continuous
data processing load, you can cut costs.
Spot instances refer to special deals where you get compute power at a spot
price. However, the moment the price goes up and the spot price is not met
anymore, you lose access to the computation.
IaaS and PaaS depend on the hardware configuration. Let’s examine what we can configure
in the cloud.
Processors (CPU/GPU)
As devices perform computations on data, processors are often divided into their most com-
mon forms: central processing units (CPUs) and graphics processing units (GPUs).
Many engineers compare CPUs to a human brain that handles many different operations.
The key indicator of a processor’s strength is the Million Instructions per Second (MIPS),
which tells how many tasks a processor can handle in parallel per second. From 0.002 MIPS
for the first processors in 1951, peaks of 2,356,230 MIPS for modern CPUs are quoted.17 A
weak CPU configuration can, for instance, impact the data transfer between nodes if they
are CPU-bound.
The original goal of GPUs was to free the CPU from the heavy computational load required
by 3-D animations. GPUs are optimized to perform many matrix operations well, making
them perfect for neural networks. GPU processors from manufacturers like Nvidiaor Ha-
bana Labs, an Intel subsidiary, are installed on numerous cloud-based data science plat-
forms.
17
https://handwiki.org/wiki/Instructions_per_second
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 47
Network
While users of cloud systems do not need to think about cabling, they still need to configure
virtual networks in the cloud to control access to network segments. One central aspect is
how to encrypt and use private channels through VPC.
One aspect that also hits data professionals is network configurations for firewalls. In many
cases, when deploying new platforms, ports might be closed, and data professionals need a
minimum understanding of how to address this.
Memory
Typical big data clusters have 64 GB to 512 GB of RAM per node, depending on their use
case. These sizes allow enormous amounts of data to be loaded into memory, and the mem-
ory size per node is also a decisive factor in how many users can work with the cluster in
parallel.
I/O
Hard disks were considered the “bottleneck” in computer systems for a long time.18 We
reached mechanical limits fast, and for years, the I/O innovation was slower than for other
hardware components. With the advent of SSDs, computer systems had new ways to in-
crease performance. For example, cloud providers offer19 provisioned IOPS configurations
with up to 256.000 IOPS. They are significantly faster than standard configuration (up to
16.000 IOPS) and, therefore, explicitly suited for databases with low response time require-
ments.
Exercises
Ask ChatGPT which GPUs are available for data science applications and
how they can affect model building. You can go into details for specific
products and ask how they differ.
Imagine you are preparing data for analytical modeling. You write Spark
jobs that use a cluster. The performance is slower than expected. Your
boss asks you for ways to speed it up. Where do you start?
Pick a network certification of your preferred cloud provider or read books
recommended to achieve the certification to learn all you can about pro-
tocol stacks.
Read the PolarFS paper20 and understand how PolarFS differs from dis-
tributed file systems like the Hadoop Distributed File System (HDFS).
18
https://en.wikipedia.org/wiki/I/O_bound
19
https://aws.amazon.com/ebs/volume-types/
20
http://www.vldb.org/pvldb/vol11/p1849-cao.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
48 3 Cloud Services
Two principles, originally from agile software development, are essential for building data
science platforms:
KISS (Keep it simple, stupid) recommends keeping systems as simple as possible. The
less complex a unit is, the easier it is to create a system that scales it.
YAGNI (You ain’t gonna need it) recommends keeping systems simple by explicitly not
building what is not needed to execute a unit’s core functionality.
In Chapter 1, we introduced Cesar, an ambitious data professional who proposes a lot of
technology to build something modern. If a technology does not fulfill a purpose, it violates
both principles.
Let’s assume we have built our services, which provide the necessary functionality most
simply. Suddenly, the number of requests increases, and the first setup can no longer han-
dle them. What can we do?
Scale-up or vertical scaling improves the system’s performance by replacing one hard-
ware configuration with a more powerful one. However, even the best hardware will not
solve a problem above a particular load.
Scale-out or horizontal scaling means distributing the load across more hardware. Ex-
perts also refer to distributed processing or parallelization.
Let’s look at how this can look in a practical example.
21
https://www.computerscience.gcse.guru/theory/von-neumann-architecture
22
https://en.wikipedia.org/wiki/File:Von_Neumann_Architecture.svg
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 49
Imagine we have one node that runs one job. As the workload increases, users complain
about long wait times. Scaling up would mean replacing this node with a more powerful
node. But what if we decided to take two nodes instead of one node with more hardware
resources to scale out?
There are now two scenarios.
Load Balancing: In this scenario, there is a second component, a load balancer. The load
balancer distributes requests rather than processes them. We can add more nodes if
needed. In this scenario, all the worker nodes are entirely independent. If one node fails,
the other node will still process its jobs.
Distributed Computing: We expect the software to distribute the algorithms on multiple
nodes for processing.
The latter scenario is more complex. You need functionality that orchestrates data process-
ing on multiple machines. This includes synchronizing intermediate results and statuses
between these processing components. The software must also react to disturbances, such
as a node failure or delayed synchronization. Let’s look at how this can be done.
3.3.2.2 MapReduce
When data is processed locally on a single computer, a process loads data into memory.
Then, an algorithm is applied to the loaded data to modify it or calculate its results. These
results can be displayed on the screen or saved as a file. If the amount of data to be loaded
is larger than the available RAM, delays may occur as the data processing engine needs to
swap intermediate results to the local hard disk during processing.
Figure 3.6 indicates the additional overhead using a simplified representation of distributed
processing. We must imagine that we store all data on different nodes in a distributed sys-
tem. This mechanism means each node needs to process as much as possible until a frame-
work can collect reduced data on a few nodes. In a local environment, all this preprocessing
first on separate nodes and joining later is unnecessary.
Algorithms are executed first on the data stored on the so-called ‘data nodes.’ Then, analo-
gous to local processing, processes running on the data nodes load the data into RAM, apply
the algorithms, and cache the results. Finally, the intermediate outcomes on the individual
nodes are merged centrally in a second step. The details of this process, called MapReduce,
are described on the Hadoop page, a framework that uses MapReduce.23
In practical applications, layers abstract this type of processing and provide a unified API.
One such framework is Apache Spark, which is explained in more detail in another chapter.
Every engineer should be aware that local processing has a lower overhead. It is always
faster if the data is small enough to be processed locally.
23
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce
Tutorial.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
50 3 Cloud Services
Exercises
24
https://www.guru99.com/introduction-to-mapreduce.html
25
https://www.amazon.com/Designing-Data-Intensive-Applications-Reliable-Maintainable/dp/1449373321
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 51
Although many services related to data analytics are consumed through cloud-based PaaS
and SaaS solutions, there are still many use cases where data professionals need to use Li-
nux-based VMs. They can be hosted in the cloud, or in some cases, data needs to be extracted
from on-premises solutions, and all a data engineer gets is a connection to the console.
Even if the DevOps or operations engineer assigned to set up systems data platforms is often
separate from a data team, speaking their language helps to complete the desired setup.
Also, data engineers occasionally go to the console to test connections using standard Linux
tools such as netstat.
Some argue that with time, fewer and fewer people will work with the command line, and
more processes will be abstracted through tools. Generative AI might, however, revive the
command line. ChatGPT gives exact answers to which Linux commands can be used to
achieve what was asked in the prompt. It is more challenging to show how to do it in a graph-
ical user interface.
Terminal Environment
Anyone who works intensively with operating systems must be familiar with the command
line. Therefore, it is helpful to choose an exemplary terminal environment. For example,
those who work with a Mac can use the application iTerm2, which offers more functionality
than the standard terminal emulator. Windows users will often use PuTTY.
Docker, in turn, is the standard for containers and can be understood as an encapsulated
environment within an operating system. After installation, the following command will
instantiate a container based on the devenv image:
Programmers unfamiliar with Docker can remember the relationship of images to contain-
ers like this: An image is like a class, and a container is like an object. That’s why we like to
use the term “instantiation”; what it means is immediately apparent to any programmer.
26
https://www.unum.cloud/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
52 3 Cloud Services
Once you have executed the command, you will have set up a virtualized environment on
this operating system, which encapsulates a mini-Linux. Windows users can also try some
Linux commands below through this docker container.
Shell
If you work with the console, you must remember many commands. A simple solution is to
search for previously used commands in the command line with Ctrl+R. For example, if you
have executed the docker run command, you can activate a search by using the keyboard
shortcut Ctrl+R. If you type run, it will take you to the last command with run. You can also
repeat the keyboard shortcut Ctrl+R to jump to earlier run usages. You can enhance this
search via hstr27, allowing you to search through your command history interactively.
It is worth looking at the available keyboard commands to avoid constantly moving the
cursor around and saving the navigation keys. Be aware that some shells, such as zsh28, a
bash replacement, might vary in detail.
Package Manager
You should install software with a package manager. Package managers retrieve reposito-
ries containing software packages usable for the respective operating system. You can up-
date these repositories via the Internet.
There are various standard package managers for each operating system and the different
Linux distributions. Here is a small selection:
27
https://github.com/dvorka/hstr
28
https://www.zsh.org/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 53
The command du obtains information about disk space, which might run out. The -h param-
eter specifies that the output should be human-readable. Another command to try out is
df -hu. ChatGPT can provide more insights on how to monitor systems.
To work with several windows in one terminal, look at the tool tmux. It helps
to manage multiple terminals in parallel in the command line. It also offers
numerous additional valuable functionalities.29
29
https://github.com/rothgar/awesome-tmux
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
54 3 Cloud Services
Network
The command curl ifconfig.me can find a host’s public IP. This can be helpful for debug-
ging connection issues between servers. You should be able to connect to various servers
using an SSH key without entering a password. ChatGPT provides detailed instructions on
how to do that.
Another common practice is to test the performance of two servers with tools such as iperf3.
Again, ChatGPT can help here. Below are some network commands.
If you want to try these commands, you can type the following:
nmap localhost
nmap google.com
30
https://computingforgeeks.com/minimal-container-operating-systems-for-kubernetes/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 55
As expected, the command revealed that google.com has two open ports: HTTP and HTTPS.
Trying to log in there via SSH on port 22 would be pointless. However, the situation is dif-
ferent with localhost, which has opened port 22 for SSH connections.
Use the following command to determine which services exist on the system that listens to
port 8080. Use grep to filter the output:
Figure 3.10 With grep as a parameter, all services on port 8080 are filtered
The following output shows which connections Skype uses, and you can see that some con-
nections are open:
Finally, we apply netstat to look at the address 192.168.1.0 from another side:
Process Management
Processes also need to be managed. Typically, you would terminate background processes
or change their priority.
You can use the ps command to find the currently running processes. With grep, you can
filter the output for specific terms.
This way, you also get the process ID. You can terminate processes with kill.
kill <pid>
Rights Management
To ensure that the data access permissions are correct, you should be familiar with com-
mands such as chmod and chown, which change access rights and file owners.
It is also helpful to know commands like useradd, passwd, and gpasswd to create users and
generate passwords in test environments.
Vim
A text editor like Vim is essential for any engineer working with open-source platforms. You
usually connect to servers via SSH and work on the console.
What makes this editor unique is the vast number of shortcuts. It takes a long learning
curve and constant editing, but someone who gets used to working with Vim and memo-
rizes the most critical shortcuts will achieve results faster than with a GUI-based editor.
Going into the details of Vim is out of scope. There are a lot of excellent tutorials on the in-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 57
ternet. You should get familiar with the editor nano if Vim is not installed and installing
software is blocked.
Below, we summarize some activities for those wanting to explore this domain further.
Exercises
A central principle in the data science environment is reproducibility. Thus, a result with-
out the possibility of third-party verification is not meaningful from a scientific point of
view.
Reproducibility in data science projects also means enabling an independent party to create
infrastructures to perform experiments at the push of a button. Configuration drift, the
manual modification of existing infrastructures, can influence query results.
31
https://linuxunplugged.com/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
58 3 Cloud Services
try:
CreateNetwork(CIDR)
for i in range(6):
CreateServerinstance(i)
print(“Created 6 instances!”)
except(PlatformCreationError e):
print(“error”)
You would have to extend these few lines of code significantly to cover numerous excep-
tional cases. This excerpt describes only a tiny part of the possible problems during the
creation of an infrastructure:
The cloud provider needs more resources for new infrastructures, or building the target
system would exceed quotas.
The service account, which you use to execute the automation tool and to create infra-
structures, lacks the rights to create individual infrastructures on the target platform.
Elements that an automation tool wants to create already exist.
Also, tools must query variables that describe individual configuration parameters. Below
is a small excerpt of such variables:
The resources (CPU, RAM, disks, etc.) of compute server instances
The CIDR ranges of networks
The configuration of the encryption of data and services
Firewall rules and access configurations
Load balancers and their configuration
To program a comprehensive solution, you must write many routines to read and evaluate
the parameters. The complexity of your code would rise exponentially with the elements in
your infrastructure. At some point, it would be challenging to read. For this reason, a ready-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.3 Infrastructure as a Service 59
made solution such as Terraform is a good choice. To install Terraform, follow the instruc-
tions on the manufacturer’s page.32
In the first example, we download a Docker image and use the heredoc notation.
If you are using Terraform 0.14 or higher, you also need to add the following file.
The next step is to use the declaration defined as the basis for a rollout. Three commands
are essential here: init, plan, and apply. You must execute them in the main.tf file direc-
tory.
terraform init
After calling the init command, Terraform creates a local repository in a subdirectory in the
execution directory. Terraform looks in the local folder for a main.tf file. That file expects the
infrastructure configuration that defines what it needs to build. Terraform also looks for a
declaration of a target platform, such as AWS, Azure, or GCP, in *.tf files. In our specific
case, Terraform will determine the need for Docker. In our example, Terraform will load a
provider plug-in for Docker locally into the repository.
terraform plan
This command verifies that the declarations in the *.tf files are syntactically correct, and
Terraform can execute it. Terraform prepares an execution plan but has yet to execute com-
mands against a target system.
terraform apply
You can trigger a rollout by calling the terraform apply command. Terraform will try to
create all required infrastructure elements and return an error if this is not feasible. Fig-
ure 3.13 shows a successful rollout. Finally, you can check the status with terraform show.
32
www.terraform.io
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
60 3 Cloud Services
You can use the following command to verify the correctness of the deployment and that we
have deployed a ghost image:
docker image ls
If you look at the run directory, you will find the subdirectory .terraform, where all loaded
provider plug-ins are stored. A plug-in encapsulates the functionality to control the rollout.
Alternatively, you can list providers via the following command:
terraform providers
Having already declared a Docker image, we declare a cloud resource of the provider plug-in
for AWS, as shown below. The block in this example itself is minimalistic. Only one variable
determines that a CIDR block is declared. There are many other configuration options for
this element. 33
33
https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.4 Platform as a Service 61
terraform destroy
Besides Terraform, other tools such as Ansible, Chef, or Puppet are often used in conjunc-
tion with Terraform. They have a different purpose and are usually called configuration
management tools. While Terraform is used to create infrastructure, these tools install
software automatically. So, if you wanted to roll out a vanilla Kafka installation on the cloud,
you would first roll out virtual machines on a cloud provider using Terraform and install
Kafka on them using Ansible, Chef, or Puppet.
Exercises
Every cloud platform provider offers many services for their clients in many different cate-
gories. Artificial intelligence and data science services are just one category of many other
service types. It is almost impossible to describe all of one provider’s AI and data services
in a book. With the quick development of these services, the book would be already outdated
when published. The services can be specific, addressing concrete problems connected with
using AI, such as cloud providers offering services specifically for computer vision, fraud
detection, or predictive maintenance. Other services target developers to provide them with
a platform with a wide range of tools to build their AI service.
In this book, we focus on the three major cloud providers: AWS, Azure, and GCP; it is essen-
tial to highlight that many other vendors also offer suitable solutions.
Cloud providers offer commonly native and external (or marketplace) solutions. The first
category is the artifacts that the cloud providers create themselves, and many of these ser-
vices are known to a wide range of data professionals, such as AWS Redshift or Azure SQL
Server. Third-party companies provide their clients with solutions through the cloud pro-
vider in the second category. Marketplaces also contain IaaS solutions, such as specific
virtual images for particular purposes, but more common are platforms. For instance, DMBS
developers, such as MongoDB Inc., provide their product MongoDB as a marketplace service
on the cloud for their clients.
All providers offer cloud-native suites of PaaS solutions for machine learning, data analyt-
ics, and artificial intelligence workloads. In this section, we summarize the most common
solutions. There are, however, pages for each cloud provider to dig deeper.34,35,36
3.4.1.1 AWS
Amazon SageMaker
Amazon SageMaker enables data scientists to build, train, and deploy machine learning
models. It facilitates the development of high-quality models by removing the heavy lifting
from each step of the machine-learning process. It provides multiple built-in algorithms,
one-click training, and model-tuning capabilities and supports deploying models at scale
with zero downtime.
AWS Glue
AWS Glue is an ETL service that cost-effectively categorizes, cleans, enriches, and reliably
moves data between various data stores. Data professionals use it when preparing and
transforming data for analytics and machine learning. AWS Glue provides a serverless en-
vironment that scales automatically to meet the processing needs of their data jobs.
34
https://aws.amazon.com/solutions/ai-ml/
35
https://azure.microsoft.com/en-us/products
36
https://cloud.google.com/products
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.4 Platform as a Service 63
Amazon Redshift
Amazon Redshift is a fast, scalable data warehouse that can analyze data across data ware-
houses and lakes. Redshift performs exceptionally using machine learning, massively par-
allel query execution, and columnar storage on high-performance disks. It integrates with
data lakes, allowing data professionals to perform unified analytics across a complete data
landscape.
Amazon Athena
Amazon Athena is an interactive query service that analyzes data in Amazon S3 using SQL.
As a serverless component, data professionals do not need to manage infrastructure for
Athena, and they pay only for the queries they run. This makes it a simple yet powerful tool
for analyzing large-scale datasets without requiring complex ETL jobs to prepare data for
analysis.
Amazon Comprehend
Amazon Comprehend is an NLP service that uses machine learning to uncover insights and
relationships in text. It provides sentiment analysis, entity recognition, topic modeling, and
language detection capabilities to organize and categorize text data efficiently.
Amazon EMR
Amazon EMR is a platform that runs and scales big data workloads such as Apache Spark,
Hive, and Presto. Data professionals can execute jobs on various infrastructure services,
such as Amazon EC2.
Amazon Forecast
Amazon Forecast uses machine learning to deliver highly accurate forecasts. It automati-
cally discovers how product features, holidays, and weather impact your business metrics to
make informed decisions. Forecast requires no machine learning experience to get started
and can be used to forecast sales, inventory levels, and operational demand.
Amazon Rekognition
Amazon Rekognition helps to identify objects, people, text, scenes, and activities in images
and videos and detect inappropriate content. Rekognition also provides highly accurate fa-
cial analysis and facial search capabilities that clients can use to detect, analyze, and com-
pare faces for various user verifications, people counting, and public safety use cases.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
64 3 Cloud Services
3.4.1.2 Azure
Azure Databricks
Azure Databricks is an Apache Spark-native analytics platform optimized for Microsoft
Azure cloud services. Designed with Apache Spark’s founders, it integrates with Azure
services to provide a one-click setup, streamlined workflows, and an interactive workspace
that enables collaboration between data scientists, data engineers, and business analysts.
Azure Databricks supports various data analysis and machine learning tasks, making it a
powerful tool for big data processing and analytics.
bots that can naturally interact with users through websites, apps, Cortana, Microsoft
Teams, Skype, Slack, Facebook Messenger, and more. Azure Bot Services leverages Micro-
soft’s AI capabilities to create bots that can understand and interpret human needs, making
it easier to implement solutions that engage and assist users.
Azure HDInsight
Azure HDInsight is a fully managed cloud service for open-source analytics. It makes pro-
cessing massive amounts of data easy, fast, and cost-effective. You can use popular open-
source frameworks such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, and R. HDInsight also
integrates with Azure Data Lake Storage, Azure Synapse Analytics, and Azure Machine
Learning to build end-to-end analytics pipelines.
3.4.1.3 Google
Google AI Platform
Google AI Platform is a comprehensive suite for ML developers and data scientists. It offers
a managed service that facilitates the entire ML workflow, from data ingestion and prepa
ration to model training, evaluation, deployment, and prediction. It supports known ML
frameworks, including TensorFlow, PyTorch, and scikit-learn, and provides a user-friendly
interface and command-line tools.
BigQuery
BigQuery is a serverless, highly scalable, cost-effective multi-cloud data warehouse de-
signed for business agility and running analytics at scale. It offers real-time insights with
SQL-like queries executed over petabytes of data. BigQuery’s serverless architecture means
clients don’t need to manage any infrastructure or scaling, allowing them to focus on ana-
lyzing data to find meaningful insights. Integration with machine learning and BI tools
further enhances its capabilities for advanced analytics projects.
Cloud Dataflow
Cloud Dataflow is a fully managed, serverless service for stream and batch data processing.
Based on Apache Beam, Dataflow enables developers to create complex data processing
pipelines that can handle real-time and historical data at scale. It’s designed to remove the
complexity of developing separate batch and streaming data systems, providing a unified
programming model.
Cloud Dataproc
Cloud Dataproc is a managed Spark and Hadoop service that simplifies the running of
Apache Spark and Hadoop clusters in the cloud. It provides a cost-effective, scalable, easy-
to-use service for data pipelines, querying, streaming, and machine learning. Dataproc inte-
grates seamlessly with other Google Cloud services like BigQuery and Cloud Storage, en-
abling clients to process and analyze large datasets.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
66 3 Cloud Services
Cloud AutoML
Cloud AutoML allows users to train high-quality custom ML models with minimal effort and
machine learning expertise. It offers services tailored to specific tasks, such as vision, lan-
guage, translation, and tabular data predictions, making it easier for developers to incorpo-
rate AI into their applications without a deep understanding of machine learning models.
There are many external solutions for a cloud provider. Services outside of a cloud are often
integrated through so-called marketplace solutions. One example is Azure HDInsight, which
incorporates Hadoop. MongoDB also offers its solutions through every prominent cloud
provider.
This book summarizes Databricks and Snowflake, which exist on AWS, Azure, and GCP.
Both platforms are versatile and cater to all everyday use cases for data platforms—Data
Warehousing, Data Lakes, Data Engineering, Data Science, and Data Applications.
If clients work more with structured data and have more traditional analytical use cases,
Snowflake may be the better choice. Those who prefer to work on machine learning use
cases partly with unstructured data may prefer Databricks.
Vendor Lock-in
Even though Snowflake and Databricks are available on all common cloud
platforms, it is still important to highlight potential vendor limitations.
Migrating from one cloud provider to another might not be straightforward.
Databricks and Snowflake have unique features for each cloud, which might
make migration difficult. Also, when one platform is integrated well with
other cloud services, changing to another provider is complex.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.4 Platform as a Service 67
Vendor lock-in also affects the platform as well. If everything runs well on a
platform like Snowflake or Databricks, it requires effort to move to another
system. Also, existing solutions may have some regulations in place to make
it difficult to migrate to Snowflake or Databricks.
3.4.2.1 Snowflake
Snowflake is a multifaceted, cloud-native data platform that is reshaping how organizations
store, access, and analyze large volumes of data. This cloud-based data warehousing solu-
tion is noteworthy for its architecture and capabilities, which separate job execution from
data storage.
1. Storage Layer
The storage layer of Snowflake is built on top of cloud storage solutions like Amazon S3,
Google Cloud Storage, or Microsoft Azure Blob Storage. This allows for massive, almost lim-
itless data storage capabilities at a relatively low cost. Data is stored in an optimized, colum-
nar format, which is ideal for analytics.
2. Compute Layer
The compute layer consists of virtual warehouses, essentially clusters of computing re-
sources. These can be scaled up or down independently of the storage, allowing precise
control over performance and cost. A key feature is that multiple virtual warehouses can
operate on the same data without contention, making it ideal for concurrent workloads.
1. Scalability
The architecture of Snowflake enables seamless scalability. Organizations can scale up or
down based on their needs, paying only for their computing and storage.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
68 3 Cloud Services
2. Performance
Snowflake’s performance is a standout feature, with its ability to handle large volumes of
data and complex queries quickly and efficiently.
4. Security
Snowflake’s security is robust, with features like automatic encryption, role-based access
control, and compliance with industry standards.
5. Data Sharing
Snowflake facilitates secure and easy sharing of data sets between different users and orga-
nizations, a feature not typically found in traditional data warehouse solutions.
3.4.2.2 Databricks
Databricks, founded in 2013 by the creators of Apache Spark, is a leading platform in big
data analytics and artificial intelligence. As a cloud-based service designed to provide an
end-to-end solution for handling massive amounts of data and extracting valuable insights,
the platform is built upon a robust ecosystem that includes Apache Spark, Delta Lake,
MLflow, and more, making it a powerhouse for data processing, analytics, and machine
learning.
1. Apache Spark
Apache Spark is at the heart of Databricks. It’s an open-source, distributed computing sys-
tem offering a fast, general-purpose cluster-computing framework. Spark is designed to han-
dle large-scale data processing tasks across many nodes.
2. Delta Lake
Delta Lake provides ACID transactions, scalable metadata handling, and unifies data ware-
housing and lakes. It enhances the reliability of data lakes, which is crucial for large-scale
data analytics.
3. MLflow
Databricks’ MLflow is an open-source platform that manages the machine learning life
cycle, including experimentation, reproducibility, and deployment.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
3.5 Software as a Service 69
4. Collaborative Environment
Databricks provides a collaborative environment for data scientists, engineers, and busi-
ness analysts to work effectively together. The platform offers notebooks, dashboards, and a
workspace that supports multiple languages, including Python, R, Scala, and SQL.
2. Performance
Leveraging the power of Apache Spark, Databricks offers exceptional performance for big
data processing and analytics.
3. Scalability
The cloud-native architecture allows for easy scaling, from small datasets to petabyte-scale
data processing.
37
https://openai.com/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
70 3 Cloud Services
■ 3.6 In a Nutshell
Data Architecture
Zoltan C. Toth and Sean McIntyre
■ 4.1 Overview
As soon as companies identify a few data analytics use cases, several questions will arise:
Should we start with a simple tool like Excel, or build a fully-fledged AI solution immedi-
ately?
Do we want to manage our infrastructure or buy managed services from cloud providers
instead?
Should we use the data as is, such as CSV files or plain text files, or do we need to work to
store them in a data warehouse to enable more sophisticated analytics?
Will ad-hoc or regular analytical queries suffice for our use case, or do we need real-time
analytics?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
72 4 Data Architecture
How much data do we have? How well do our analytical capabilities need to scale?
Do we have a good sense of what kind of answers we expect from data analytics?
To achieve data-driven excellence, analytical systems must be backed by a robust, well-func-
tioning, high-performing data architecture. The most important feature of a good data archi-
tecture is that it suits your specific needs, so that it enables you to answer business ques-
tions as efficiently as possible. In this chapter, we describe the fundamental methodological
and technological cornerstones of such architectures so that you can answer the questions
listed above.
Figure 4.1 illustrates Maslow’s hierarchy of needs, an idea in psychology proposed by Abra-
ham Maslow in 1943. This idea outlines the hierarchy of human motivations: The need at
each stage must be satisfied within the individual before the following stage should or will
be pursued.
This concept not only applies to human psychology, but also is directly applicable to a com-
pany’s journey toward data science readiness. The outline of such a data needs hierarchy is
demonstrated in Figure 4.2.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.1 Overview 73
You must get raw data collection right in order to build meaningful data cleansing and
integration flows.
Only when clean data is integrated into a data warehouse or a data lake, and relevant data
points are interconnected, can a company implement a meaningful Business Intelligence
(BI) and Reporting layer efficiently.
Business needs to be understood through standard BI and data analytics, in order to open
the space for Machine Learning and Advanced Analytical use cases.
Research suggests1 that AI applications like Large Language Models (LLMs) are enhanced
with well-organized data.
All in all, there is no easy way to achieve data analytics excellence. You will need to get the
basics right to build a reliable data architecture that provides maximum business value. The
following section summarizes the basic requirements for creating a data architecture for
analytical, reporting, and machine learning use cases.
Good architecture is conceptually simple, easy to operate and to modify, and fulfills the
three basic requirements: Reliability, Scalability, and Maintainability [1].
Reliability: The system is expected to keep operating correctly even in the event of partic-
ular types of faults, including hardware faults, software faults, and human error. While
even the best-designed system won’t be able to cope with every possible kind of fault,
reliable data architectures will continue operating correctly – at least to a degree – in
1
https://arxiv.org/pdf/2311.07509.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
74 4 Data Architecture
certain foreseeable failure scenarios. These can include network outages, hardware fail-
ures, system restarts, encountering unexpected data points, and faulty data ingestion or
transformation code.
Scalability: Even if a data architecture is working well today, it doesn’t necessarily mean
it will keep working reliably in the future. Businesses grow, and the amount of data col-
lected and processed grows too. In many uses cases a well-functioning data architecture
needs to be prepared for incremental growth. Fortunately, most cloud-based data services
are created with scalability in mind. On-premises (on-prem) use cases can also take ad-
vantage of scalable solutions; the most popular open-source solutions here are the Hadoop
Distributed File System (HDFS) or Apache Spark.
Maintainability: Over time, the complexity of your data architecture is likely to increase
in data-driven companies with ever-changing requirements. To keep it well-functioning,
you will need to put significant effort into keeping it maintainable. As a company’s data
readiness improves, this requirement will need to be taken care of in multiple areas: Dif-
ferent engineers need to understand the system, implement features and changes, fix
bugs, and operate the architecture continuously.
2
In Modern Data Stack terms, Data Ingestion is often referred to as Data Integration.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.1 Overview 75
This workflow is an abstract standard across industries and data domains; The actual imple-
mentations can contain a number of sub-steps. For example:
Retail: Bring together sales data from e-commerce and brick and mortar stores to provide
an overview of best performing regions and products.
Financial services: Combine various customer datasets and apply business logic to make
accurate risk assessments.
Logistics: Track goods movements across various waypoints in the journey towards the
destination, to measure logistics performance, detect bottlenecks, and improve network
efficiency.
Marketing: Measure marketing effectiveness across advertising platforms, demographics,
and campaigns to optimize audience reach.
Customer support: Identify quality issues of products or services because of the increased
volume of customer complaints, to make process improvements.
In order to support these standard workflows, a data architectural pattern called the Mod-
ern Data Stack has emerged in the data industry3. The Modern Data Stack is a set of data
processing and storage components, each with a specialized purpose, working together to
enable scalable data analytics for an organization. Teams build a cohesive data platform
using the principles of the Modern Data Stack in order to implement data workflows, deliv-
ering data products such as curated datasets, dashboards, and AI capabilities. Since the
pattern spans many use-cases and works at-scale, professionals across many industries can
share their experiences and learn from other industries.
3
https://www.getdbt.com/blog/future-of-the-modern-data-stack
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
76 4 Data Architecture
Throughout the rest of this chapter, we will dive into the primary stages and corresponding
components of the Modern Data Stack, listed below. These are the components that are core
for building workflows, and are recommended for any organization building out a data plat-
form.
Ingestion: Bring data from various data sources into a common data platform (Section 4.2).
Cloud Data Platform: A place to store datasets that can range in size from small to ex-
tremely large, with a scalable compute layer for further data processing and querying
(Section 4.3.2).
Transformation: Combine datasets to reliably produce curated and enriched organization-
specific datasets, for deeper data analysis (Section 4.4).
Workflow orchestration: Manage complex data pipelines by ensuring that every dataset in
the pipeline is materialized on time (Section 4.5).
Business intelligence/analytics: Last, but certainly not least, present and visualize data
with dashboards and simple or complex reports. These tools are present in virtually every
company interested in utilizing their data. Examples of dashboarding tools include Looker,
Tableau, PowerBI, Qlik, Thoughtspot, Mode, Preset, and examples of notebook tools in-
clude Hex, Databricks notebooks, and Snowsight.
In addition to these primary components, secondary components can be added to a data
architecture in order to augment and scale out an organization’s data capabilities. These
components are not explained in detail in this book, but are worth exploring. Hence, we
provide a taster of the most popular secondary components, here:
Data Catalog: Enable data asset exploration across the heterogeneous data landscape of
an organization. Data catalogs can be used to promote data governance and enable dis
covery of datasets across organizations regardless of where they are stored. Data catalogs
typically allow an organization to annotate who owns what dataset, indicating which indi-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.1 Overview 77
vidual or team is responsible for stewarding the dataset. A data ownership culture is
crucial for organizations to govern high-quality datasets. Example data catalog tools in-
clude DataHub, Collibra, Alation, Atlan, SelectStar, OpenMetadata.
Data Observability: Identify and address data quality issues across the data platform, in
order to build and maintain organizational trust in data. Trust in data is easily lost and
hard to regain, and can result in inconsistent metrics and duplicative work. Data observ-
ability tools help data teams keep track of data quality across their data platform, the best
tools can even proactively alert teams of potential issues using sophisticated anomaly-
detection algorithms. Examples of tools include Monte Carlo, Anamalo, and BigEye.
Data Activation: Unlock the data produced in the data platform and activate it across the
organization. Data activation tools help organizations extract and load data within their
data platform into target systems that derive the most value from insights within the data
platform. For example, customer CRMs can be enriched with organization and prod-
uct-specific insights, which can be understood by combining various datasets within the
data platform. This is thought of as the reverse of data ingestion, and is therefore some-
times called “Reverse ETL”. (ETL – Extract, Transform, Load – will be explained in the
next section.) Examples of tools include Hightouch and Census.
Semantic Layer: Define metrics in one governed place and serve the metric data to data-
consuming applications, such as BI tools or AI applications, for organizational-wide data
consistency. The alternative is to define the metric across a variety of tools, leading to the
increased risk of inconsistent metrics and tedious duplicative work. A classic example is
to define revenue within a semantic layer, so that it can be analyzed across various orga-
nizational departments in their tools and applications of choice. Examples of tools include
dbt Cloud, atScale, and classic OLAP cubes.
Data Modeling: Plan out the abstract representation of data structures an organization
needs and validate their usefulness to the business before building them. These tools help
data architects create specifications of the physical data model before passing them off to
data engineers, and help organizations de-risk large data platform investments: Archi-
tects plan out sophisticated star schemas or data vaults and aim to future-proof their de-
signs, anticipating future business changes throughout the planning process. Examples
of tools include sqlDBM, Ellie.ai, ERWIN, and SAP PowerDesigner.
Which components of the Modern Data Stack do I need? Let’s face it,
spreadsheet analysis can get a team a long way. But, when a team or
department is ready to take the next step and invest in a data strategy, then
the primary Modern Data Stack components – cloud data platform, ingestion,
transformation, and BI – are enough to get most projects going. The remain-
ing components can then be chosen based any initial challenges which arise,
and the team’s capacity and business pressure to fix them. Is there a lack of
trust in data due to unreliable data? Research data observability tools! Is data
sprawling across many platforms, with lots of duplication and lack of owner-
ship? Maybe a data catalog is for you.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
78 4 Data Architecture
Organizations that adopt the Modern Data Stack will need to decide whether to build or to
buy products for each of the components. There are a number of options for virtually all
Modern Data Stack categories listed above, including open-source products that can be self-
hosted, products from the major cloud hyperscalers, and commercial SaaS offerings from
both startups and well-established companies. Organizations tend to answer the build ver-
sus buy question based on several factors:
how important the component is to their business,
the existing capacity to build and maintain the tool,
the cultural disposition of building or buying software,
data processing regulations, and
a return on investment (ROI) calculation, which considers engineering time versus ser-
vice cost.
One hidden challenge to be aware of in the Modern Data Stack is integration complexity: do
the selected components integrate well with each other, in order to maximize the benefit of
the entire platform? Since there are so many products to choose from in the Modern Data
Stack, each with their own product roadmap, this can be difficult to know from the begin-
ning, and could result in significant effort to maintain integrations down the line.
Although the Modern Data Stack has become a popular choice for many organizations, there
are alternatives: these include using tools developed in-house, or paying to access an all-in-
one platform that can implement the standard data processing workflow. The major cloud
proivders – Microsoft, Google, and Amazon – all have a wide range of products that can be
used together to complement or replace the Modern Data Stack. In addition, some data plat-
forms as a service (DPaaS) offerings, such as Keboola, bundle together various open-source
Modern Data Stack components.
Before we dive into the three major Modern Data Stack components, we’ll introduce the
concepts of ETL, ELT, and ETLT. This is an important prerequisite, since the choice of which
pattern will have a major impact on how data engineers implement their workflows and
select products for their desired architectural components.
Data warehouses are optimized to execute queries for analyzing data on large, well-struc-
tured datasets. An ETL workflow is a concept that supports separating the data preparation
steps into different stages, offering a clear separation of concerns for each step, like that
shown in Figure 4.5:
Extract: Raw data is extracted from different data sources and loaded into a staging area,
like a local file system, cloud storage, a distributed filesystem or temporary tables in a
database or data warehouse.
Transform: The ingested data is cleansed and aggregated, and multiple data sources are
connected into a relational format that will be ready for analytics. This operation still
happens in the staging area.
Load: The resulting cleansed and interconnected data tables are loaded into the final data
warehouse tables.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.1 Overview 79
With the advent of cloud-native data warehouses and data lakes, such as Amazon Redshift,
Snowflake, and the Delta Lake, data warehouses in the cloud can now store and process
arbitrarily large datasets for a reasonable price. As these systems have gained traction, tools
for transforming data on cloud-native data warehouses have emerged. A popular open-source
technology for defining these transformations, called “dbt” (data build tool), is described in
greater detail later in this chapter. Technologies like dbt enable companies to implement the
ELT pattern, where raw data is directly loaded into a cloud-native data warehouse. Once the
data has been loaded, both the data cleansing and the transformation steps take place using
SQL, the standard language for data querying and transformation in databases and data
warehouses.
Figure 4.6 shows how the ELT pattern implements the following steps:
1. Extract: Just like with the ETL pattern, raw data is extracted from source applications.
2. Load: After the raw data is extracted from source applications, it is loaded into a data
warehouse.
3. Transform: The data in the data warehouse is cleansed, standardized, and made available
for analytics using SQL.
4.1.6 ETLT
It is crucial to remember that these approaches are only high-level methods to think about
your data architecture. They offer best practices that have been proven to work for many
companies. However, each use case is different, and businesses need to tailor these concepts
to their specific needs. A modern data infrastructure will probably implement ETLT (Ex-
tract, Transform, Load, Transform). In this pattern, after executing fundamental transfor
mations, such as removing personally identifiable information, the ingested data gets loaded
into a data lake or a cloud-native data warehouse. Subsequent transformations are then ex-
ecuted in the data lake or warehouse, resulting in datasets ready for reporting and further
analysis.
The number and variety of sources from which businesses may capture data is virtually
unlimited. Here we will discuss some of the most widely used data source types.
The means of communication for data exchange with third-party services is called a REST
API. Technically speaking, a REST API is an API endpoint supporting HTTP commands,
which help the systems upload, retrieve, and manipulate datasets. Another good example of
a REST API is Twitter’s API: Computer programs can connect to Twitter through this pro-
grammatic API to retrieve and publish tweets and search for hashtags.
Some other popular technologies communicating through REST APIs include:
Google Analytics, a website analytics service: You can retrieve website visit-related re-
ports programmatically for your website.
Google Ads: Download ad performance data and manage ads.
Instagram: Download post data such as likes and comments and ad performance data.
Forex brokers: Download data such as market data and historical data for buy and bid
prices of different currency pairs.
Facebook: Get data about you and your friends, publish status updates, retrieve posts from
businesses, and so on.
Skyscanner: Search flights programmatically.
Yahoo finance: Retrieve information about stocks and quotes.
News APIs: Access headlines and articles from various news sources.
Beside these examples, a number of services provide streaming data sources, allowing data
to be consumed real-time:
Binance: Stream real-time financial market data, including cryptocurrency prices and
trades.
Twitter API: Provides access to real-time or historical tweet data.
Twilio Stream: Real-time streaming of events for communications data, including SMS
and voice calls.
While building a data architecture, we tend to think about high-level concepts such as third-
party data sources, data warehouses, and data stored in ERP systems. Some of our data is
probably stored in plain files; let them be on the company’s servers, in a Cloud Object Store
or the Data Lake (covered in Section 4.3.2). Picking the right file types for different kinds of
data can be essential in building data infrastructure, as the performance and ease of use of
working with different file formats vary significantly. Event data, such as clickstream data
in online services, are often stored in plain text-based “log files” or files containing JSON
objects. Tabular data might be stored in legacy file formats, such as CSV files, as well as in
modern, high-performance file formats, like Parquet. In the next section, we’ll give an over-
view of the main characteristics, advantages, and disadvantages of these formats.
As you can see, the webserver recorded an HTTP request on 6 October 2021 where a visitor
searched for the term datascience on the website. These events are written to a file, each
event in a new line.
Advantages of plain logs:
Very easy to capture events: no need to structure the data at the time of writing, which
enables fast writes.
Easy to read for the human eye.
Disadvantages:
It is plain text: There is no pre-defined structure in the file. If you want to integrate this
into a data warehouse, you will need to extract the relevant information from every line,
which leads to slow reads.
As no structure is enforced, this format is prone to errors.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.2 Data Ingestion and Integration 83
4.2.2.2 CSV
One of the most popular ways of storing tabular data is comma-separated value (CSV) files.
A CSV file consists of an optional header and a set of columns delimited (usually) by a
comma. Here is an example showing registrants of a website:
You can see straight away that there are four columns as indicated in the header line, and
the records follow in a structured way. CSVs are very popular, yet they also have several
shortcomings.
Advantages:
Probably the most popular format for storing data.
Virtually every data-processing technology supports CSVs.
Easy for humans to read.
Disadvantages:
Text-based format: This uses a lot of disk space.
Prone to errors: As an example, a single value that contains an unexpected comma or new
line character can break the entire structure.
No explicit schema: Even though the column names are usually provided in the first line,
the column data types, such as whether a column contains numeric or text values, isn’t
indicated anywhere.
No unified standard: CSV format might vary from software to software, such as using
different characters for field separators or line endings.
CSVs can’t natively store complex structures like lists or images.
4.2.2.3 XML
The Extensible Markup Language (XML) – a 25-year-old file format developed to store struc-
tured data – is still quite widespread, even though its presence is mostly fueled by old leg-
acy systems. The main goal of XML was to create a file format which can store arbitrarily
complex data records and yet is both human- and machine-readable. The XML format stores
every data element between a start tag and an end tag, and tags can be nested inside each
other. Here is a quick example for storing the registrants of a website in XML:
<registrants>
<registrant>
<registration_time>2021-10-09 15:21:02</registration_time>
<email>[email protected]</email>
<ip_address>14.3.221.170</ip_address>
<age>42</email>
<country>AT</country>
</registrant>
<registrant>
<registration_time>2021-10-09 15:24:10</registration_time>
<email>[email protected]</email>
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
84 4 Data Architecture
<ip_address>132.33.21.154</ip_address>
<age>42</email>
<country>DE</country>
</registrant>
...
</registrants>
Advantages:
Virtually every data-processing technology supports XMLs or has third-party support for
them.
Relatively easy-to-read for the human eye.
Can store complex data objects.
Disadvantages:
A very verbose text-based format: It uses a lot of disk space and is slow to process by ap-
plications.
Even though every XML element has its own name, no explicit schema is defined.
Going out of fashion: Data applications tend to replace XML-based data formats with more
modern ones like JSON or Parquet.
4.2.3.1 JSON
JavaScript Object Notation (JSON) is an open-standard file format for storing structured
data. Though it originates from JavaScript, a popular language for web applications, the
JSON format is language-agnostic. It is a popular format, easy to write and read, and
well-supported in virtually every data processing technology out there. . JSON offers a straight-
forward way to store key-value pairs and lists in a text-based format. A sequence of JSON
objects – called JSON lines – is usually used for storing multiple records in a single file. A
simple example of a JSON lines file would look like this:
Advantages:
Most data-processing technologies support JSON; it is one of the standard formats of stor-
ing data today.
Relatively human readable.
Able to store complex data objects.
Flexible: No need to pre-define a fixed data schema
Disadvantages:
A text-based format: It uses a lot of disk space, and it is slow to process.
Although every JSON element has its own name, the JSON format doesn’t define an ex-
plicit schema.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.2 Data Ingestion and Integration 85
4.2.3.2 Parquet
Parquet was developed to address many of the disadvantages and limitations of the file
formats mentioned above. It is a highly efficient, open-source file format optimized for high-
speed data analytics, and works well with large and complex data sets, too. Parquet is widely
adopted, and is often considered the go-to format for cloud-native data warehouses, data
lake technologies, and big data tools.
Parquet boasts several features which support high-performance analytical capabilities at
scale. For example:
Explicit schemas: Parquet stores a metadata record in the data file itself, which defines
the structure of the data it contains. There is no need to read through a whole Parquet file
to infer its schema.
Compressed format: Instead of storing the data as text, such as CSV and JSON, Parquet
uses data encoding algorithms to store the data in compressed format. This compression
not only increases processing speeds, but also decreases cloud computational costs, since
many cloud technologies charge based on the amount of data scanned per query.
Columnar storage format: The Parquet file stores data in columns instead of rows4. As a
result, querying and filtering data takes less time compared to traditional file formats, as
the Parquet reader doesn’t need to read through complete rows to extract values for indi-
vidual columns.
Stored internal statistics: Batches of the data are stored in blocks, and Parquet calculates
and stores statistics about the records contained in these blocks. These statistics enable
the Parquet reader to skip reading irrelevant blocks of the Parquet file when filtering
data.
One caveat when working with Parquet is that you won’t be able to manipulate records. The
same concepts that make Parquet highly performant for data analytics hinder its ability to
update or delete records quickly. However, as you’ll see in the next pages, some solutions,
like Lakehouses and the Delta Lake format, mitigate these limitations.
4
To learn more about the columnar storage format, visit https://parquet.apache.org
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
86 4 Data Architecture
Advantages:
Database-like features and guarantees.
Many advanced features like Time Travel and Schema Evolution.
Disadvantages:
Somewhat slower than Parquet when working with small datasets.
Delta Lake is only fully supported by two technologies: Apache Spark and Databricks.
Depending on the use case, you may have a plethora of data sources and file formats to work
with. Some of the basic file formats, like plain log files, Excel files, CSVs, or XMLs, are hard
to avoid due to their popularity and their frequent use in legacy systems. If you have the
freedom of choosing the file formats to use, picking a format that provides a built-in schema
and high-performance, like Parquet or Delta Lake, can go a long way. One of the lowest
hanging fruits for analytical purposes is to convert the data you ingest into a highly-perfor-
mant analytical format. Pick the Delta Lake if you work with Apache Spark or Databricks
and pick the Parquet format if you use other tools. Not only will these technologies make
your life easier with schema guarantees and other advanced features, but they can also in-
crease the performance of your data pipeline by a magnitude compared to using formats
like CSVs, XML, or JSON for data analytics.
Don’t forget that you don’t necessarily need to deal with these formats directly. Several
technologies, including Data Warehouses, Data Lakes, and Lakehouses, exist to help you
manage metadata, keep your data organized, or to use SQL to query the file formats dis-
cussed in this chapter. We’ll examine these next.
■ 4.3 D
ata Warehouses, Data Lakes, and
Lakehouses
Now that we have discussed the different file formats that fuel today’s data analytics tech-
nologies, let’s look at the modern data warehouse and data lake landscape.
You might be familiar with data warehouses (often referred to as DWs or DWHs), as they
have been around since the 1980s. Data warehousing is a technology used for reporting and
data analytics. The data warehouse is a database that is implemented and organized in a
way that enables high-performance analytics. DWHs tend to use column-based storage for-
mats (formats built on the same concepts as Parquet) to enable the highest possible perfor-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.3 Data Warehouses, Data Lakes, and L akehouses 87
mance for analytical workloads. The storage technologies DWHs use are usually specific to
the data warehouse technology, in order to enable tight integration between the storage and
the compute/analytics component of the DWH. Data warehouses host tables with complex
data types, but they are generally unable to store unstructured data, such as images. The
standard language to communicate with a data warehouse is SQL.
A DWH is a specialized technology for accomplishing a single task very well: High-perfor-
mant data transformation and analytics through SQL. It is probably the best solution to use
for reporting and dashboarding purposes; a well-structured DWH can calculate the results
of simple analytical queries required for business reports in well under one second. This
performance is an excellent fit for business dashboards, where dozens of reports must be
displayed within a few seconds.
From an architectural point of view, we can differentiate between types of data warehouses
along two dimensions: on-prem versus cloud-based, and cloud warehouses with or without
external table support.
do the analytics. As the workload increases, you can simply request more compute nodes be
added to the DWH cluster. Changing the cluster size takes from a few minutes up to a few
hours.
Figure 4.8 High-level architecture of a cloud-based data warehouse with external table support
As you see, integrating external tables into data warehouses solves some of the problems
traditional DWH solutions encounter. However, they still fall short on processing unstruc-
tured or semi-structured data or reacting quickly to peak load for every table. Data lakes and
the data lakehouse concept solve some of these problems. We will discuss this in the next
section.
reads and writes, schema enforcement, and schema evolution. Lakehouses are solutions
developed to mitigate these problems.
Scalability: As Lakehouses are built on the top of cloud data lakes, they scale exception-
ally well and have no problem analyzing massive data sets.
Elasticity: The compute capacity of a data lake can be drastically increased or decreased
in a matter of minutes. This way, they can react very well to peak workloads.
They can store any kind of structured or unstructured data: Lakehouses support tabular
data, images, voice recordings, videos, and several file formats like Parquet, CSV, and
JSON.
All in all, DWHs are great for workloads which demand performance supporting interactive
use-cases , like reporting and dashboarding. Lakehouses are great for more general data
science and data transformation workloads where sub-second query execution times are not
a priority, but scalability is.
demands for analytics. In such situations, pre-computing analytical results and storing
them in the cloud data platform can speed up analysis, or move the organization from a
place of reactive reporting to algorithmic response. For example, web traffic produces noto-
riously large datasets that many customer-facing organizations have to deal with in order to
understand online customer behavior. A/B testing is a common practice to make small in-
cremental changes to a website, such as upgrades to a user interface, in order to boost web
sales. With enough daily web traffic, it’s not enough to run naive database queries nor
single-machine transformations to perform the requisite analytics. Instead, large-scale data
processing techniques are necessary in these situations to process the raw data and create
summarized datasets, such as web sessions per visitor, that can be analyzed as required by
the teams who operate the A/B testing program at the organization.
It’s possible that both consistency and computational efficiency are both reasons for per-
forming data transformations. Sometimes it’s a small dataset that needs to be consistent, or
a large dataset that doesn’t need to be consistent. All of these cases are indicators that data
needs to be transformed before it is analyzed. Otherwise, perhaps data transformations are
not necessary.
So, how should data transformations be implemented? In the Modern Data Stack, a popular
answer has emerged: use SQL! This is due to cloud data platforms having first-class and
highly performant support for SQL, and SQL being a highly accessible language to teams of
data engineers, data analysts, and beyond. However, in some cases, SQL is not enough: per-
haps SQL can’t compute the required transformations fast enough, or it can’t express the
complexity of transformations. In such scenarios, complex workloads need to be performed
using specialized parallel compute frameworks, the most popular being Spark.
No matter the implementation of data transformations, there are good practices borrowed
from software engineering that are agreed upon in the Modern Data Stack:
Use source control (git): This creates an audit log of changes to transformations, allowing
for quick rollbacks of code, collaborative teamwork, and retrospective analysis of data
pipelines in case of audit.
Perform code reviews: Create an approval process and a culture of checking each others’
work, to catch errors and to reduce the team’s lottery ticket factor (that is, the level of
disruption which would occur if a teammate won the lottery and immediately quit).
Multiple environments: Create isolated development (“dev”), Quality Assurance (QA), and
production (“prod”) environments to increase the quality of production datasets and
speed of delivering bug fixes and new features.
Use continuous integration: Automate testing in a QA environment to provide fast feed-
back to developers to catch data quality issues before they reach production.
Make modular code: Apply DRY (don’t repeat yourself) principles by separating transfor-
mations into logical steps in order both test and re-use intermediate states in other trans-
formations.
Document datasets: Document while coding in order to make the work accessible and
understandable to other technical or business users (self-service).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 95
4.4.1 SQL
5
https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
96 4 Data Architecture
Meanwhile, data teams use SQL as a data transformation language to turn raw data into
curated datasets called data warehouses in order to power organization-wide analytics.
Releasing: How are new SQL code changes integrated into a data pipeline, while minimiz-
ing or eliminating unintended consequences? This can be a difficult problem for any data
team, using SQL or otherwise. A standard best practice borrowed from software engineering
is to create multiple environments, typically “dev”, “test”, and “prod”. Developers work in
the “dev” environment where they are allowed to make any change without fear of conse-
quence, because the data is never used by end-users. Then when the developer is satisfied,
they move their code changes into the “test” environment, where quality assurance (QA)
steps and user acceptance testing (UAT) is performed. Finally, once the required change
tests pass (either by human or machine, or both), the code changes are integrated to “prod”,
and the next production run will execute the latest SQL code.
Over the years, many products and open-source tools have entered the market with the
purpose of helping teams tackle the challenges associated with such practices. Notably,
graphical tools such as Informatica PowerCenter, Wherescape, and Pentaho have allowed
teams to manage their SQL transformations with point-and-click interfaces. But since 2020,
the open-source project, dbt, has emerged as the top choice for a data transformation tool,
due to how it helps teams address the aforementioned challenges of collaboration, sequenc-
ing, and releasing, using a code-first approach.
customer.phone_number,
{{ cents_to_dollars(‘account_balance’) }} as account_balance_dollars,
customer.market_segment
from {{ ref(‘stg_tpch_customers’) }}
inner join {{ ref(‘stg_tpch_nations’) }}
on customer.nation_key = nation.nation_key
inner join {{ ref(‘stg_tpch_regions’) }}
on nation.region_key = region.region_key
Listing 4.5 SQL code that is executed on the database with the dbt build command
create or replace transient table analytics.marts.dim_customers as (
select
customer.customer_key,
customer.name,
customer.address,
nation.nation_key as nation_key,
nation.name as nation,
region.region_key as region_key,
customer.phone_number,
(account_balance / 100)::numeric(16, 2) as account_balance_dollars,
customer.market_segment
from analytics.sources.stg_tpch_customers
inner join analytics.sources.stg_tpch_nations
on customer.nation_key = nation.nation_key
inner join analytics.sources.stg_tpch_regions
on nation.region_key = region.region_key
);
A ref statement has two purposes. The first purpose is for developers to indicate which up-
stream dbt models the dbt model depends on. In most other code sequencers and orchestra-
tors, such as Airflow, a developer typically explicitly writes out the sequence of operations,
which becomes cumbersome as the list of dependencies grows. Also, in other code sequenc-
ers, the failure modes in out-of-sequence transformations are usually not immediately clear,
and can be difficult to debug. With dbt ref statements, however, the sequence of operations
is implied and maintained by the developer through the process of writing dbt models, and
the dbt build command is able to interpret the proper order of commands as a data lineage,
also known as a directed acyclic graph(DAG). This makes it possible to know the lineage of
data in advance of executing the dbt code, thereby solving the sequencing problem for most
teams. This is one of the key features that makes dbt so popular.
Figure 4.9 Data lineage in dbt is simple due to ref and source statements, and works at object-
and column-level
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 99
The second purpose of a ref statement is to allow developers to write code once, and let the
dbt execution environment determine exactly which physical database object to reference at
runtime. This means the same code can be run in any environment, including “dev”, “test”,
and “prod”, just by changing environment configurations like target database, target
schema, execution database user, git branch, and environment variables. It’s a common dbt
practice for each developer to have their own development environment because they are so
cheap and fast to create, thereby improving the speed of development of the team. This key
feature helps solve the releasing problem.
Source statements are similar to ref statements, but they refer to objects in the data platform
that have been loaded by processes outside of dbt, as discussed in Section 4.2. This allows
dbt developers to manage the end-to-end data transformation process entirely within dbt,
from sources to the data consumption layer.
Within a dbt project, developers define what database objects they want to be created as
part of their data warehouse, one database object per dbt model. The default approach is
to materialize each dbt model as a table or view, by wrapping the SELECT statements in
CREATE TABLE AS or a CREATE VIEW AS statements, respectively. While this works in
many cases with smaller datasets, a dbt model can be upgraded to perform incremental
loading with a few short steps: changing the dbt model materialization setting from “table”
or “view” to “incremental”, specifying a unique key, and adding an optional WHERE clause
to the SELECT statement to identify new records (typically a timestamp or increasing ID
sequence). When dbt build executes on an incremental model, behind the scenes dbt will
create a temporary table with the new records and run a MERGE statement to insert or up-
date the new records in the target table. This lets developers focus more on the business
logic of SQL transformations, and spend less time on repeatable data engineering tasks,
which is a strong contributing factor to solving the collaboration problem, particularly when
dbt models are modular.
There is a testing framework built into dbt as a first-class construct. Developers may add
tests to their transformations in order to validate the assumptions they have about the data
they are working with. The basic out-of-box column tests developers can choose from are:
uniqueness, not null, accepted values, and relationship (foreign key). In addition to the
testing framework, dbt allows developers to define data contracts on the datasets that dbt
produces, providing a guarantee on the shape of the data, and constraints to consumers of
the datasets, which gives them confidence to build on top of that dataset. Tests and contracts
are defined in YAML configuration files within projects alongside the dbt code. Each test is
executed during a dbt build or dbt test command by converting the test into a SQL state-
ment that is executed on the database to validate the test. Contracts are enforced at the be-
ginning of build-time, to prevent any out-of-compliance data from being persisted. Tests are
recommended to be run by developers during development, when opening up a Pull Re-
quest during a continuous integration (CI) process, and during production runs. Mature dbt
teams rely on dbt tests as part of their release processes to catch errors before end-users do.
Documentation is also a first-class construct in dbt: developers may document their work in
the same YAML configuration files, and with the command dbt docs generate can create
a website providing all details about the dbt project, which can be utilized by users across
the organization. This documentation may also be persisted directly in the database as com-
ments, and exported to other tools such as data catalogs, covered more in the next section.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
100 4 Data Architecture
Listing 4.6 Example YAML file for defining tests and documentation
models:
- name: dim_customers
description: >
A customer is someone who has purchased but not
canceled their order.
access: public
columns:
- name: customer_key
description: >
Unique identifier of our customers, across all
company ERP systems.
tests:
- unique
- not_null
- name: region
tests:
- accepted_values:
values: [
‘AFRICA’, ‘MIDDLE EAST’, ‘ASIA’, ‘EUROPE’, ‘AMERICA’
]
...
Optionally, a developer may also use “dbt macros”, also within double curly braces, in order
to generate SQL statements in their dbt model, typically to re-use code or simplify their dbt
model. Macros can bridge the gap some users feel when they come from stored procedures,
to re-use logic across multiple transformations. Macros also allow developers to use for-
loops and if-statements, are evaluated at dbt compile time, and are ultimately used to help
construct SQL SELECT statements that are materialized as tables, views, or incrementally.
You can see an example of the cents_to_dollars macro in Listing 4.4.
When to use Python, when to use SQL. In addition to SQL, dbt can also
execute python code, allowing teams to seamlessly integrate their SQL
transformations and python code in the same execution sequence without
requiring any additional tooling, provided their database has a python inter-
face. dbt users often prefer to use SQL over python due to the simplicity
of SQL, however there are a few cases where python is simply better: for
example, using python libraries that have been specifically designed to
handle dates and public holidays, perform regular expression search, and
assist in data science applications. Machine learning training and labeling
can be operationalized in a dbt DAG by adding python models alongside SQL
models, and ref statements and source statements are similarly available
in dbt python models.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 101
dbt Core is designed for single-developer use. As it is open-source, it can be simply installed
into a workspace and developers can get started building their first dbt transformations
within minutes, for free, as described on the dbt website6. For new dbt projects, the dbt init
command can be executed to quickly scaffold a sample project and database profile.
dbt Cloud is designed for teams and enterprises to adopt dbt at scale, with additional fea-
tures aimed at improving an organization’s dbt developer experience, deployment, reliabil-
ity and security, and reducing multi-project complexity. The features developed by the dbt
Labs team for dbt Cloud are informed by the needs of a typical data team using dbt, and
include: a web-based IDE with a simplified git interface, an orchestrator with logging and
alerting, built-in continuous integration testing, interactive documentation which serves as
an organization-wide catalog of dbt assets, APIs for partner and developer integrations, and
more. Notably, dbt Cloud stores all metadata from dbt job runs, and surfaces this informa-
tion in order to help developers discover pre-existing transformations across the organiza-
tion, and to reduce cloud data platform spend and bottlenecks by identifying costly parts of
the data transformation pipeline.
In addition to added functionality, dbt Cloud provides solutions for common enterprise se-
curity concerns, like SSO, OAuth integrations to the cloud data platforms and git providers,
role-based access control, customer support, SLAs, and the option for dbt training and pro-
fessional services.
From a data developer’s perspective, using the open-source dbt Core is a great proposition
as it is free to download, it boosts productivity, and some even consider it fun to use. This
has made dbt popular across the data industry globally as a preferred tool for data transfor-
mation. Meanwhile, as dbt adoption has organically grown amongst data developers, data
leaders are now thinking about how to best adopt dbt as a key tool in their data platform in
order to meet demanding business objectives and optimize the return on data platform in-
vestment.
6
https://docs.getdbt.com/docs/core/installation-overview
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
102 4 Data Architecture
The new standard for data transformations. dbt is popular, and dbt Labs,
the commercial entity that stewards the dbt projects and owns the dbt trade-
marks, claims that “dbt is the new standard for data transformations”. While
rooted in a group of initial users working for digital natives, dbt is steadily
growing more and more popular within established enterprises who are
adopting cloud data platforms and looking for solutions to replace and mod-
ernize their existing legacy data transformation tools.
In addition to these approaches, many data platforms now offer native dbt support, includ-
ing: Airflow, Dagster, Matillion, Fivetran, Paradime, Talend, and more. These platforms often
focus on the deployment aspect of running dbt within a data engineering pipeline, but some
do have additional features for an improved dbt developer experience. As well, other tools
similar to dbt have sprung up, such as Dataform and Coalesce.io.
Regardless of which data platform dbt projects are run on, they are most commonly stored
in an organization’s Git platform, such as GitHub, GitLab, or Azure DevOps. The advantage
of using git with dbt is to share the dbt project with teammates, track changes to the dbt
code, and to enforce code review and automated checks against new pull requests. This kind
of social governance is key to ensuring that a dbt project is high quality; for example, it can
help team members agree on how to best split apart large dbt models into smaller more
modular transformations. Git also makes it possible to define code release and code rollback
processes, borrowing the best practices from software engineering developed over the last
decades.
7
https://seattledataguy.substack.com/p/the-state-of-data-engineering-part-b61
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 103
wide variety of data tasks, such as loading data into databases, transforming data, executing
data science workloads, delivering reports, refreshing dashboards, and more. dbt is focused
solely on the data transformation aspect of the data lifecycle, and so there is harmony in
executing a dbt workload from a workflow orchestration tool: it allows for dbt to be part of
the end-to-end data lifecycle owned by the data platform team, but enable a workforce to
build and execute SQL-based transformations in order to satisfy business requirements.
This embodies separation of concerns and allows more people to be part of the data lifecycle
within an organization.
Data Catalog: Once dbt has been adopted by an organization, it is common to integrate all
the valuable information within dbt to an organization’s data catalog. Data catalogs have a
natural overlap in dbt, as dbt describes what datasets are being created within an organiza-
tion and has rich information including lineage and documentation written by the develop-
ers. For example, dbt developers can tag which tables and columns contain PII within dbt,
and this information can be propagated into a data catalog, which improves an organiza-
tion’s data governance practices overall. Examples of data catalogs include DataHub, Colli-
bra, Alation, Atlan, SelectStar, and OpenMetadata.
In the earlier sections we saw that several technologies support ingesting, integrating, and
storing data. Some of these technologies, like data warehouses, come with built-in data
processing components, while others, like data lakes, don’t. In most use cases, you won’t be
able to avoid using a dedicated data processing technology in at least one stage of your data
architecture. Modern Data Stack-based approaches, such as using managed services for
extracting and transforming the data, might be sufficient for certain use cases. For more
sophisticated use cases, several traditional open-source technologies are available for sin-
gle-machine transformations and analytics, such as Linux bash utilities, the R language, or
Pandas, a Python library for data processing. Running on a single machine makes a data
architecture relatively easy to maintain; However, it also comes with several challenges:
Having all the data processing logic on a single computer makes the workflow prone to
errors. The failure of your computer will make the whole infrastructure unavailable.
As the amount of data grows, you can end up in a situation where you can no longer scale
your computer just by adding more CPU or memory. The resources needed by the data
transformation jobs might outgrow the data processing capabilities, leading to excessive
data processing times.
Even if you can process all your data on a single computer today and you have a large
margin for upgrading the hardware, you might eventually outgrow a single computer. In
such a case, it can be cumbersome to reimplement all the data processing workflow in a
technology that scales well.
Fortunately, a few data processing technologies work very well on a single computer and
then effortlessly scale to multiple computers when needed. The most versatile technology
for large-scale data processing is Apache Spark, which has become the de-facto scalable
data processing technology in the past years. Apache Spark plays a pivotal role in the Lake-
house implementations of large cloud-based data platforms like Databricks.
In the next session we will deep-dive into the architecture of Apache Spark.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
104 4 Data Architecture
At the foundation of Spark is Spark Core. Spark Core8 is responsible for supervising the
compute nodes, taking care of data passing between the nodes, and managing the execution
of distributed data processing algorithms.
One of Spark’s “unified” aspects is that it provides you a unified view of different file sys-
tems, data lakes, and external databases. Also, it gives you a standard set of commands to
work across different formats. Spark supports many file formats out of the box, such as CSV,
JSON, Parquet, and plain text files. The supported input sources include local filesystems,
HDFS, S3, Azure Blob Storage, HBase, Cassandra, and external databases through JDBC, a
standard, SQL-based database connection protocol. Several other file formats and input
sources are supported through external libraries, including XML, AVRO, Apache Kafka,
Redshift, Azure Synapse, MongoDB, and Elasticsearch.
There are three APIs for communicating with Spark:
The DataFrame API: As the highest-level of the three, the central concept of this API is the
DataFrame, which is a table-like abstraction. The DataFrame API is the most optimized,
used and developed API, and should be your standard choice when you use Spark.
The RDD API: Spark’s low-level API, which you can think of as the “assembly language”
of the system. The other APIs internally use it, but it is hard to use directly and takes
8
https://github.com/apache/spark
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 105
much effort to write performant code. If you are new to Spark, you most probably want to
stay away from this API.
The DataSet API: This is Spark’s data engineering-focused API, which gives you strong
type guarantees when you develop Spark applications. This API is only accessible in Scala
but not in Python. It is used internally by Spark regardless of your choice of language. Use
this API if you need better compile-time error reporting during Spark application devel-
opment, but only if you are also an advanced Spark user: You need to have an excellent
understanding of Spark internals to use this API efficiently.
Using Spark’s DataFrame API is the best choice in most use-cases.
Spark provides an SQL API too. With the SQL API, you can write pure SQL code to process
and transform datasets. Both the DataFrame API and the SQL API are highly optimized with
the help of Spark’s internal optimizer, Catalyst.
When it comes to programming language choices, Spark supports four languages out of the
box:
Scala: MostSpark code is written in Scala, a functional programming language which
runs on the Java Virtual Machine, just as Java does. As Scala is the native language of
Spark, every Spark feature is accessible through this API. One of the advantages of using
Scala is having access to the DataSet API if type guarantees are a priority in your use case.
Java: As Java uses the Java Virtual Machine for executing code, just as Scala does, there
are many similarities between these two APIs. You might want to use the Java API if your
primary expertise lies in Java and if you don’t plan to use Spark for exploratory analysis.
Notebook environments used for data exploration, such as Jupyter or Zeppelin, don’t sup-
port Java.
Python: The Spark Python API is quite advanced. On top of accessing all the features
Spark offers through the DataFrame API, the Spark Python API has close integration with
Pandas, Python’s de-facto single-computer data analytics tool. Through the Python API,
you can convert between Pandas DataFrames and Spark DataFrames and execute Pandas
functions at scale.
R: Spark natively supports R, one of the major languages used in statistics. The R API
seems to be a bit of an outlier in the Spark ecosystem, sinceit is lagging behind in com-
parison with the other language APIs when it comes to what Spark functionality it can
access (though it has been closing the gap in recent years).
All in all, if you have a data engineering heavy use-case, you might want to consider the
Scala API, and if you are unsure which language to choose or are new to Spark, use Python.
The driver acts as the manager of the whole cluster. It processes the commands executed by
the Spark user and coordinates the distributed data processing of the executors. There is
always a single driver node in a Spark application, and it doesn’t do any distributed data
processing.
The executors are the workhorses of a Spark application. They are responsible for reading
and processing the data and exchanging any necessary information to accomplish the dis-
tributed computation. The number of executors in a Spark application has no practical up-
per limit: Facebook reportedly uses a cluster exceeding two thousand computers. The exec-
utors consist of cores, which are the basic compute units of Spark. You can think of them as
if they were CPU cores on a computer. This architecture provides two parallelization layers:
there are multiple executors in a cluster, and each executor has multiple cores. These two
layers add up to the total processing capability of your cluster; for example, three executors,
each with two cores, results in a six-core application. A six-core application means that
Spark can execute six tasks in parallel.
Reading from different data sources is also accomplished in a distributed way. Spark will
split the input source into partitions and process these in parallel. The default partition size
is around 100 megabytes, a conveniently sized piece of data for a single task to process.
These partitions are then read one by one by the Spark application cores. If the number of
partitions to read exceeds the number of cores available, partitions are put in a queue and
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 107
processed one after the other. As the processing proceeds, the executors exchange data and
statistical information between each other to complete a distributed computation.
In a standard setting, the number of executors is determined upfront. When you start a
Spark application, you tell Spark how many executors and cores you need. These resources
will then be made available for you, and only then will your application be ready to use.
However, if you use Spark in the cloud or in a container-orchestration system such as Kuber-
netes, you can use Spark’s dynamic resource allocation capabilities and increase or de-
crease the number of executors while the Spark application is running.
sets, and you want to print the results to the screen or write the results to the disk, only
then will Spark read the contents of the data sources and execute the computation. This
feature is essential in a big data processing setting: your datasets might be hundreds of
terabytes in size, but only a tiny fraction of the input data might need to be read for com-
puting the result of a specific transformation.
DataFrames are immutable: In Spark, you never modify a DataFrame. Applying an opera-
tion, such as a filter, always creates a new DataFrame. This concept is harmless because
of Spark’s lazy nature: DataFrames don’t store physical data. Immutability is required for
Spark’s optimizer: it can review all the transformations to be executed and optimize them
before the physical processing starts.
Full SQL support: DataFrames can be queried, joined, and manipulated using SQL com-
mands.
To show some of the capabilities of the DataFrame API, let’s take an example of some Python
Spark code, which reads a Parquet file, filters it, and writes some records back to disk. The
example dataset we are using here is the one we introduced in Section 4.2.2.2.
01 from pyspark.sql.functions import col
02
03 registrants_df = spark.read.parquet(‘/data/input/registrants.parquet’)
04 above_40_df = registrants.filter(col(‘age’) > 40)
05 names_40_df = above_40_df.select(“registration_time”, “email”)
06 names_40_df.write.parquet(“/data/output/above_40.parquet”)
Besides its usability characteristics, the main advantage of the DataFrame API is that it
comes with Catalyst, Spark’s logical optimizer. Before an action is executed, such as the
write command in the example above, the transformation steps are optimized. Even in this
simple case, as we are using the Parquet format, the Catalyst Optimizer can push the filter
condition down to the Parquet reader and instruct the Parquet reader to read the age,
registration_time, and email columns only.
Now let’s see the same in a real-time setting, where we read incoming JSON files from the /
data/registrants folder as they appear, and we continue writing the processed email ad-
dresses to /data/output/gmail-registrants.csv:
13 from pyspark.sql.functions import col
14
15 registrants_df = spark.readStream.schema(‘email STRING’)
16 .json(‘/data/registrants/’)
17 gmail_df = registrants.filter(col(‘email’).endswith(‘gmail.com’))
18 only_email_df = gmail_df.select(‘email’)
19 only_email_df.write.csv(‘/data/output/gmail-registrants.csv’)
When comparing these two code pieces, you can see that the only difference is in line 7. We
specified that we are reading from a stream, and we had to tell Spark which values should
be picked up from the JSON file.
Many companies who ingest streaming data implement two parallel data processing pipe-
lines: one for real-time data processing, such as monitoring the number of registrations re-
al-time, and one for static processing, such as calculating the one hundred percent accurate
value of daily registrants at the end of each day when late data points have arrived. This
approach of having real-time and static data processing side by side is called the Lambda
Architecture. As you can see, the similarity between static and streaming processing of data
in Apache Spark significantly simplifies the implementation of such an architecture, as
most of the code can be shared between the static and the real-time components.
9
For a complete list of the supported algorithms please refer to the Spark MLlib documentation at https://spark.
apache.org/mllib/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
110 4 Data Architecture
As the popularity of Apache Spark has risen in the past decade, several companies have
started to offer managed Spark-based solutions. Using these solutions, you can spare your-
self the time of setting up and operating a Spark cluster. In exchange you pay an hourly fee
for the compute resources you use while using managed Spark services.
Every major cloud provider offers managed Spark Services:
Amazon Web Services has two offerings: Apache Spark on Amazon Athena, a managed
service which provides a user-friendly interactive data analytics service, and EMR, a man-
aged Spark solution where you have more control over the Spark infrastructure, and
which is more suitable for Spark-based data transformation jobs.
Google offers a service called Dataproc, with which you can use managed versions of a
number of open-source large-scale data processing tools, like Spark.
Microsoft Azure’s HDInsight offering, similarly to Google’s, offers a number of managed
data analytics and transformation tools, Apache Spark included.
It is worth highlighting a software as a service solution, Databricks. Created by the original
authors of Apache Spark, Databricks features an optimized version of Apache Spark, inte-
grated into a unified platform for data engineering and analytics. Many of the advanced and
10
https://horovod.ai/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.4 Data Transformation 111
high-performance Spark features are offered in Databricks before they are released as open-
source Spark code.
MLflow, a machine learning model lifecycle management and model registry solution, is
tightly integrated into the platform.11
Workflow Orchestration: Notebooks and files can be executed on a schedule, and sophisti-
cated error management and debugging tools are provided.
Orchestration tools usually come with a built-in scheduler, which executes your DAG regu-
larly. A common choice is a daily execution, which starts after midnight and executes the
whole data pipeline. The execution frequency is highly dependent on the business case:
While for many use cases, a daily schedule is sufficient, in other cases, you’ll want to use
more frequent executions to cater to regular reporting updates or data quality checks.
The essential features of an orchestration tool are:
DAG support: As basic as it sounds, supporting a DAG and managing dependencies be-
tween jobs is essential for workflow orchestration. You can get away with a purely se
quential execution of your data management jobs in the early life of a data architecture.
11
MLflow is also available as an open-source project at https://mlflow.org/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.5 Workflow Orchestration 113
However, as complexity grows, you will face complicated dependencies, which a good
orchestrator will help you manage.
Parallel execution: To scale your data workflow, parallel execution of non-dependent tasks
is essential.
Scheduling: The scheduler is the core of an orchestration framework. It is vital to be able
to set up different schedules for various workflows and to manage historical re-loads of
the data.
Error reporting: A functionality with often-overlooked importance is which tools are avail-
able when things go wrong. You want to see which job failed, why it failed (that is, you
want to be able to examine the output and any generated error messages and temporary
files), and which other jobs were unable to start because of the failure.
Alerting: When errors happen, you want to be alerted as soon as possible. Most tools come
with different alerting capabilities, ranging from sending emails to firing a pager or call-
ing the maintainer over the phone using an automated call provider.
Error recovery: When jobs fail, you will need to re-execute them and execute every other
job that depends on the failed job. In many cases, you will need to do this not only for the
last execution period but for a longer timeframe. Re-executing all or a subset of your jobs
for an extended period is common when errors are discovered days after they occur.
Version Control: As you develop your workflow, and especially if you do it together with
other contributors, a good version control integration, such as Git, can drastically simplify
the process of tracking changes and reverting to earlier versions of the workflow when
needed. Most popular workflow orchestration tools come with built-in version control
support.
Fortunately, there are several solutions available which check all these requirements. Some
of the commonly used are:
Apache Airflow12: Probably the most widely adopted open-source orchestration tool. It has
a rich feature set and many built-in connectors for interacting with technologies such as
Snowflake and Databricks. You implement your workflow as Python code, so some coding
experience is needed, and you can use a version control system of your preference to
track the workflow’s Python files. Airflow is open-source and self-hosted, although major
cloud providers offer managed versions of Airflow, too. You can see a screenshot of an
example airflow DAG in Figure 4.14.
Prefect13: A young orchestration framework. Open-source, and just like Airflow, Prefect
allows you to implement your workflow in Python.
Dagster14: Another fresh orchestration solution. Open-source and Python-based, just like
Prefect and Airflow.
Azure Data Factory15 (ADF): A fully managed workflow automation solution in Azure. ADF
provides a no-code workflow orchestration solution, and it has tight integration with the
12
http://airflow.apache.org/
13
https://www.prefect.io/
14
https://dagster.io/
15
https://azure.microsoft.com/en-us/services/data-factory/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
114 4 Data Architecture
services available on Azure, such as Azure Synapse and Azure Databricks. An example
DAG created in ADF is shown in Figure 4.13.
Kestra: A new-generation orchestration tool that focuses on declarative definition and
ease of use of building data pipelines.
We’d like to highlight Dagster as a data orchestrator solution that aligns well with the
principles of the Modern Data Stack. It comes as a Python-based open-source solution, but
dagster labs, the company behind Dagster, offers a cloud-based alternative, too.
The main differentiating features of Dagster:
An asset-based approach: Most orchestration solutions, such as Airflow and Azure Data
Factory, offer a task-based approach. The workflow is built by interconnecting tasks that
depend on each other, producing the data assets. In Dagster, the central object of orches-
tration is the data asset itself: Instead of defining tasks to execute, it focuses on a data-cen-
tric approach to orchestration, where workflows are designed around the movement and
transformation of data itself.
Data testing and validation: Dagster makes data testing a first-class citizen of your work-
flow, enabling testing assets as they move through the pipeline.
An excellent user interface: Dagster makes overseeing, executing, and debugging the data
pipeline easy by offering a sophisticated yet easy-to-use web interface.
Integration into the Modern Data Stack: Dagster works seamlessly with Modern Data
Stack components. It provides a plug-and-play integration of data ingestion and transfor-
mation platforms, such as Airbyte and dbt. It can also integrate, visualize, and orchestrate
the data pipelines defined within these tools.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.6 A Data Architecture Use Case 115
General Architecture
The general design of such an architecture is shown in Figure 4.16. The arrows represent
the data flow in the system.
$'"
$'!# %#$ "
"$$"#
The facility data is available in Excel format and is automatically uploaded to the Azure Blob
Storage. A Databricks notebook is created: it reads the facility data from the Blob Storage
and writes it into the Azure data lake in Delta Lake format. This dataset will store the facil-
ities’ data and the metadata about the sensors and metrics, but it won’t store the actual
measurements.
The measurements come from the smart meters installed in the facilities. Every smart me-
ter sends its metrics to the Sensor API component, which is managed by our operations
team. The Sensor API exposes a REST endpoint through which the Sensor Integrator compo-
nent can retrieve the measurements in JSON format.
We implement a Sensor Integrator component that connects to the Sensor API, retrieves the
sensor data for a specified time range, and uploads it into the Azure Data Lake Storage in
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.6 A Data Architecture Use Case 117
raw JSON format. Once finished, a Databricks job is executed to cleanse this data and write
it back to the data lake in Delta Format.
We use Delta Lake on top of the Azure Data Lake Storage for sensor and facility data. The
Data Lake Storage guarantees cross-region replication and high-throughput data read, so
our data is kept safe, and our analysis is ready to scale if needed. The Delta Lake provides
us with all the guarantees described in Section 4.3.2.2, and automatically keeps the history
of the data. This means that if we run into data problems, earlier versions of both the sensor
and the facility datasets can be recovered.
Once both the facility and the sensor data are in a Delta Lake format and cleansed, we are
ready to join them, enriching the sensor data with facility and meter-related information.
We store this final, cleansed and enriched, record-level dataset again in a Delta Table.
On top of our enriched dataset, we apply several aggregations to prepare reporting and high-
level analytics. We use Databricks and Delta Lake to ensure that it will scale well even as
our data grows.
As a next step, we integrate these aggregates into the Azure Synapse Data warehouse. Syn-
apse is connected to Microsoft’s Business Intelligence tool, PowerBI, in which we can create
all the necessary reports and dashboards. At this stage, we use a data warehouse instead of
Delta Lake because our aggregated data is small enough to use traditional technologies and
because DWHs execute queries on small datasets much faster than data lakes do. A fast
response time is necessary to make sure our dashboards load quickly.
Finally, we use the Serverless Python Cloud Function to generate an email report based on
the aggregated tables in Delta Lake. We send this to the BI analysts once a day.
This architecture would also work well for multiple customers and facilities simultaneously.
In such a scenario, we need to take care of permissions:
Specific records in Delta Lake should only be accessed by their respective owners. You
can either set up separate tables for different customers or use record-level permission
management in Delta Lake.
Multiple customers will access the same data warehouse. We can create different tables
for each customer or implement record-level permission management.
Email reports need to be sent out to multiple customers: this should be easy to implement
by simply calculating the specific statistics for each customer in the Email Report Generator.
Orchestration
Azure Data Factory (ADF) seems a good choice for orchestrating this workflow, as we are
using Azure-related technologies for data storage and processing. We create an ADF pipe-
line that defines the dependencies between our processing steps and schedules the data
load, as shown in Figure 4.17.
The first step in the pipeline is a Copy Data activity, where the facility data is copied to the
Azure Blob Storage. In parallel, we can already start integrating the sensor data with the
Ingest Sensor Data step.
Once the facility data is copied, we cleanse it and integrate it into the Delta Lake. We use a
Databricks Notebook, as Databricks has a simple-to-use Excel reader. Also, notebook execu-
tions are very easy to debug in Databricks in case things go wrong.
Before cleansing and integrating the sensor data, we must wait for all facility data to be in-
tegrated. Sensor metrics might reference meters that were recently added to the facility
Excel. In this step, we use a Databricks notebook to read the raw JSON sensor data, cleanse
it and store it in a Delta Lake format. At this point, it is essential to use a scalable technology
such as Databricks or Apache Spark to make sure we can scale our processing capabilities
as the number of measurements increase.
At this point, we have both data sources ingested, cleansed, and integrated into Delta Lake.
Next, we create the aggregated tables and write them back to Delta Lake. This step ensures
that the BI analysts will be able to read both the cleansed record level data and the aggre-
gated data directly from Delta Lake.
In the last steps, we upload the aggregated data to the data warehouse and call an Azure
Serverless Function, which checks whether the daily report has already been sent today. If
not, it will read the aggregated data required for the report and send it to the BI analysts’
per email.
Having set up our pipeline, we put it under Git version control in ADF’s management con-
sole. We also create a trigger to execute the pipeline every 30 minutes. This schedule should
be sufficient given that our service level agreement requires data to be uploaded every hour.
Alerting rules can be set with Azure Communication Services16 to ensure that our ops team
receives both an SMS and an email if a pipeline execution fails.
16
https://docs.microsoft.com/en-us/azure/communication-services/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
4.7 In a Nutshell 119
choice to perform additional transformations to prepare the data for the Customer 360 dash-
board. We trigger dbt to run after the Airbyte data loading operation.
With all the data now available in production, the BI analyst can now create the Customer
360 dashboard in PowerBI.
Requirements Analysis
In Section 4.1.2, we defined the requirements of a data architecture. Let’s see whether our
solution checks those boxes!
Reliability: All our data is stored in the Azure Data Lake right after ingestion. The Azure
Data Lake replicates our data across various geographical regions to ensure that it is kept
safe. Furthermore, Azure Data Factory, Azure Functions, and Databricks are managed
solutions with reliability guarantees. We can define restore points for Azure Synapse. We
have Git integration set up for ADF to quickly revert to our last working version if we
break our ADF pipeline.
Scalability: The number of facilities and smart meters will probably stay relatively small,
below the one million record mark, which means we don’t need to scale there. However,
the number of metrics and data points can quickly grow, so we must be sure to use a
scalable solution for processing the sensor data. The Azure Data Lake and Databricks
both scale well, so they are a good choice for this use case.
The only bottleneck in this architecture is the Sensor Integrator. Will it ingest all the
measurements in a timely fashion? The performance of this component is a risk. However,
the problem is not only on the data architecture’s side, as the Sensor API is a single ma-
chine REST API. We need to have our engineering team scale the API first. Only then can
we implement a distributed read of the sensor data with Databricks or another scalable
solution.
Maintainability: We work with managed cloud solutions that are famously easy to maintain.
ADF and Databricks notebooks have great debuggability and maintainability characteris-
tics. Maintaining Azure Synapse might take some work when we store enormous datasets.
However, in our case we only use it for aggregates, so we shouldn’t have any problems.
■ 4.7 In a Nutshell
Building a data architecture requires several decisions. Do you want to use the
cloud or stay on-prem? Are there special governance requirements? How much
data do you expect to process in the long term? What data formats do you
want to use? What’s your fault tolerance level? Let us now finally summarize
some best practices for designing and building data architectures:
Use the cloud: Cloud solutions are designed to require zero maintenance
and high reliability. Even though your service bills will be higher than the
hardware costs when running your on-prem infrastructure, your total cost of
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
120 4 Data Architecture
ownership can drastically decrease when you use the proper cloud technol-
ogies. If you don’t have a particular reason, such as governance, to stay
on-prem, don’t stay on-prem.
Use Parquet or Delta: These modern file formats outperform traditional for-
mats by a large margin. Not only will your data processes finish earlier, but
your data will take much less space to store when you use these formats.
Use Python: If you use Spark, try to start with the Python API. Python is the
lingua franca of data today, so it is your safest choice.
Data lakes and lakehouses scale very well, but they aren’t data warehouses:
Although you can use a data lake or a lakehouse as a data warehouse, well
configured traditional DWHs can outperform data lakes by magnitudes when
working with moderately sized datasets.
Consider the pros and cons of hosted big data services: Solutions like Data-
bricks add many extra performance and convenience features compared
to their open-source alternatives, but they come with a relatively high price
tag. Know your requirements to see which solutions work best for you.
Know your data size: Are you expecting only a few hundred thousand
records in total in the long term? Good for you — you won’t need a scalable
architecture. You can get away with traditional tools and a single node data
warehouse, which will be cheaper and easier to maintain.
Keep the raw data safe: Bugs in the data pipeline can lead to incorrect data
and data loss. Sometimes the only option to fix data quality issues will be to
re-execute the whole pipeline for an extended period from scratch, starting
with the raw data. Storage has become very cheap, so make sure that you
save every incoming data point in its raw format.
Keep it simple: Only use scalable tools and advanced data processing tech-
nologies if you need to. They are more challenging to maintain than simple,
single-node solutions.
Consider single-node solutions for machine learning: Single node machine
learning solutions have much better performance characteristics than their
distributed peers. Only use a distributed machine learning technology if you
must.
References
[1] Kleppmann, Martin: Designing Data-Intensive Applications. O’Reilly Media, Inc., 1005 Graven-
stein Highway North, Sebastopol, CA 95472, 2017
5
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Data Engineering
Stefan Papp
“The more complicated the plumbing, the easier it is to stop up the works.”
Montgomery Scott
In the last two chapters, we highlighted that we need an infrastructure and a data architec-
ture to host data. Setting up an infrastructure, building data architectures, integrating new
data, and managing workflows usually overlap. Good data teams work in iterations, and they
often use agile methodologies. Data scientists might find new data sources they require for
new use cases, leading to infrastructure or data architecture adjustments.
For the sake of simplicity, let us ignore that and assume that we have agreed on the infra-
structure and data platforms to host the data. The fictional company from Chapter 1 most
likely has no strategy at all. Everything is done by improvisation. We also highlighted that
the data engineers might ignore the users’ requirements. What else could happen if we had
no data engineering strategy, like the fictive company introduced at the beginning of the
book? Here are some possible scenarios.
Too many individuals access operational databases to collect data in parallel, overloading
the source system.
Engineers do not consider the demands of analytical use cases or end users, and they
design integration pipelines that do not meet expectations (for instance, they could be too
slow).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
122 5 Data Engineering
Engineers could also overengineer solutions. Instead of simple batch jobs that would have
met the users’ requirements, they might design a superfast streaming solution using
many new complex technologies that must be maintained afterward.
Engineers ignore data protection or other management guidelines, leading to severe con-
sequences for the company.
Multiple engineers work on the same problem simultaneously, and with time, an organi-
zation has numerous jobs doing the same things, and the engineers debate which job is
the best. Precious time is lost because people get lost in engineering details instead of
solving user needs.
In this chapter, we focus on data engineering. While it may sound simple (“What should go
wrong to move data from A to B?”), as the above scenarios indicate, there is more to it once
we understand the technical details.
■ 5.1 D
ifferentiating from Software
Engineering
Looking at the titles of the headings of this chapter from a computer scientist’s point of view,
it is evident that this part of the book contains a lot of content a reader would also expect to
find in a book on software engineering.
Suppose we simplify the definition of data engineering to its core: extracting data from
source systems and transforming it to be ready for queries to provide data insights. In that
case, the overlap in creating software becomes apparent. The logic of extracting and trans-
forming data can be abstracted through source code. Source code must be stored in reposi-
tories and often deployed to production systems, where orchestration tools trigger jobs to
run. When we agree to use source code for data transformation, we must discuss program-
ming languages, configuration management, and source code deployment.
As software engineering is a vast field, it is essential to ask about the differentiation. Many
diagrams on the internet outline how data science and other disciplines like AI or ML over-
lap. Some blogs tried to create a similar differentiation between software engineering, data
engineering, and data science in Venn diagrams. It gets even more complicated when blog-
gers add DevOps engineering, ML engineering, and frontend programming as categories.
There is a risk of ending up in a detailed technical discussion of whether specific tools
should be seen as representative of one or two domains. Therefore, we take a different ap-
proach.
Figure 5.1 outlines the data engineering lifecycle from the book Fundamentals of Data En-
gineering.1
1
https://www.amazon.com/Fundamentals-Data-Engineering-Robust-Systems/dp/1098108302
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.2 Programming Languages 123
Visualization can be separated from data engineering topics, like how software engineers
differentiate between backend and frontend engineers. For data visualization, we have an-
other chapter. Also, agile development practices, standards in data, and traditional software
engineering are outside the scope of this chapter. The entire enterprise integration, includ-
ing the integration into other services and user management, is another topic that may be
relevant for a data team but will also be omitted here. We focus solely on techniques to
extract, transform, and load data from A to B.
One topic that can sometimes separate engineers into two opposing parties is whether we
need programming for ETL/ELT jobs. One party might recommend using tools that provide
a drag-and-drop experience with user interfaces to design data pipelines. While this party
will most likely acknowledge that a company may need to purchase licenses for sophisti-
cated tools from a vendor, they might argue that acquiring licenses is still less costly than
hiring data engineers skilled in programming. In the end, engineering aims to automate to
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
124 5 Data Engineering
reduce the demand for a human workforce. Why should this not include the demand for
data engineers working on a project? Ultimately, if efforts can be reduced through automation
for data integration, it can be seen as an evolution and frees up data engineers to do more
in less time.
The other side will argue that the commercial drag-and-drop tools are expensive and may
lead to a vendor lock-in. They might acknowledge that 80% of the workflows can be modeled
easily with commercial tools. Still, the remaining 20% that cannot be solved with tools might
lead to more work than if engineers programmed everything independently. While you can
adjust the code to your needs, commercial tools might need complex and inefficient work-
arounds for specific use cases requiring a non-standard approach.
For programming enthusiasts, source code is the best way to express workflows. Code can
be maintained using source control systems, enabling engineers to have a version history of
every minor change. Generative AI might bring additional argumentation to this discussion.
ChatGPT can output efficient code to solve typical problems, which means engineers with
limited experience in programming may be able to solve more complex programming tasks
using generative AI.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.2 Programming Languages 125
While this book respects everyone who advocates automating data integration, it believes
using these tools in every case is impossible. Many cases still require the skills of engineers
who know how to program.
There are many discussions about the ideal programming languages for data projects. For
example, some programmers argue that Java is the lingua franca for data applications. Their
main argument is Java’s lengthy leadership of the TIOBE index in the past years2, a popular-
ity ranking of programming languages, and the fact that most data frameworks are written
in Java.
The counterargument against Java is that it was created as an object-oriented language in
the 90s. Although it provides a rich set of features, it is not natively designed to solve data
problems, especially as it requires a lot of boilerplate code to get simple things done.
Backward Compatibility
Programmers often love new features and to have more comfort while pro
gramming. It is hard to explain to non-programmers, but the ability to solve
problems in fewer lines of code than others or more elegantly with specific syn
tactic capabilities has a substantial positive impact on a programmer’s mood.
Suppose you have learned a new programming language such as Rust or Kotlin.
You master all the new paradigms these new languages bring, and you finally
feel that you can “express yourself more eloquently in code”. It feels like a
“punch in the face” for some programmers when decision-makers choose an
old but established programming language for a new project.
One sad thing about the data industry is that adoptions to new programming
languages or even just newer versions take time. Established programming
languages cannot break their interfaces to enable backward compatibility.
The Python community experienced what may happen if changes become too
radical. It took many years to convince some programmers finally to upgrade to
Python 3. Java 8, which was released in 2014, is still the default language for
many projects although Java 21 has been released in the meantime.
In the book Geek Way, the author advocates that being a geek is part of a
culture.3 Fast paced environments that give opportunities for experimentations
are part of it. The programming language could be seen as part of a culture.
There are many projects with different programming languages that solve data problems on
the market. For instance, R is an essential ecosystem for data scientists. Other projects use
2
https://www.tiobe.com/tiobe-index/
3
https://www.goodreads.com/book/show/59456366-the-geek-way
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
126 5 Data Engineering
purely SQL for data transformation. Courses teach students how to use Kotlin4 or Rust5 for
data science and data engineering projects. While .NET is not the first technology to come
to mind when considering data engineering, especially when using Azure, many data engi-
neers might have to deal with code that provides or consumes data from services and
applications written in C#.
This book focuses on two languages, Scala and Python, because they are often found in to-
day’s ecosystem.
Query Languages
5.2.3 Python
4
https://kotlinlang.org/docs/data-science-overview.html
5
https://www.coursera.org/specializations/rust-programming
6
https://www.builtin.com/data-science/pythonic
7
https://www.amazon.com/Fluent-Python-Concise-Effective-Programming/dp/1492056359/
8
https://www.python.org/psf/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.2 Programming Languages 127
Other engineers, however, claim that Python code is more challenging to debug. Especially
without compiling, programmers might find problematic code in Python later than in other
languages. In addition, an advocate of different languages may consider it more challenging
to maintain large software projects with Python as programming languages such as Java
enforce a clear structure. In other words: “Python allows you to create a mess if you want,
while the Java compiler forces you to correct every syntax error.” Another argument is per-
formance. The default interpreter, CPython, is often slower than other programming lan-
guages, and in addition, Python has language-specific bottlenecks, such as the global inter-
preter lock (GIL)9 for multithreading. A mutex (or a lock) allows only one thread to control
the Python interpreter and slows down multi-threaded applications.
Interested readers can ask ChatGPT to write code that loads CSV in a different program-
ming language and compare the results. Traditional OOP-based languages were designed as
a Swiss army knife to solve business problems. While all the principles of OOP for building
enterprise applications still have their purpose, in an agile world where you want to try
things fast, programming languages like Python have significant advantages that we will
explain in this section.
9
https://realpython.com/python-gil
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
128 5 Data Engineering
Figure 5.3 Python code to load CSV to Kafka (as generated by ChatGPT)
Readers shall decide for themselves, but simple functions such as creating a simple routine
to load a CSV file to Kafka are small and readable.
5.2.3.3 Libraries
Another reason Python is so prevalent in data is the number of libraries on the market that
solve data problems. The book Python for Data Analysis lists the following libraries:10
NumPy
pandas
matplotlib
IPython
SciPy
Other articles also add TensorFlow, SeaBorn, Theano, Keras, and PyTourch.11
It is easy to populate data structures for analytics. For example, in Figure 5.4 we need two
lines to load a custom data structure in a pandas DataFrame.
All these libraries fulfill their specific purpose, are optimized to load data into optimized
data structures and provide methods to get results. This is a considerable asset that now
gets even stronger with ChatGPT. It is possible to ask ChatGPT to show sample code for data
science. Everyone can test that by asking ChatGPT to show an example of a Fourier trans-
formation using NumPy or any similar question.
5.2.4 Scala
10
https://www.amazon.com/Python-Data-Analysis-Wrangling-Jupyter/dp/109810403X/
11
https://www.coursera.org/articles/python-machine-learning-library
12
Python for Data Analysis, Wes McKinney
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
130 5 Data Engineering
The quote above is from a book that advocates Python. Some engineers see Scala as the
primary language for data engineering. Scala is a native functional programming language,
enabling us to scale better in a distributed environment.13 Another argument for Scala is
that compilers translate code into a JVM-compatible intermediate language. On the other
hand, critics call Scala too complex compared to Java or Python. As a result, it might take a
lot of work to onboard developers who still need to learn the details about Scala.
Asking ChatGPT why a data engineer might want to use Scala will highlight more features,
such as concurrency support, type inference, strong static typing, immutability, persistence,
or expressive syntax. One scenario using Databricks can provide a concrete example.
5.2.4.2 Databricks
Section 5.6.3 explains web notebooks in detail for those unfamiliar with them. An engineer
can choose his programming language for each notebook section to run data transformation
jobs. Programmers commonly choose between Spark, SQL, and PySpark.
Many engineers use PySpark, the Spark interpreter based on Python, in Databricks Note-
books. People who prefer Scala might reason that Spark is written in Java and Scala. There-
fore, Python code in a notebook needs to be translated into JVM-compatible code under the
hood and back when code is executed as a Spark job.
The code below is a good example. Under the hood, Spark uses DataFrame objects and keeps
them in memory, distributed over many nodes in a cluster. As Spark is based on Java and
Scala, data representation in memory is in a JVM-compatible format.
Let’s focus on the collect() method. This command reads the content of the data frame
and converts it into an object in local memory. As everyone can see, such an operation
would only make sense if the object has been prefiltered so that the result also fits into a
local variable. The collectedRows is, in this case, a local variable based on a Scala object.
Of course, this code could also be written in PySpark. In a PySpark version, we also could
collect data from a DataFrame object in a Python object. However, as the source comes from
data distributed over many server nodes as JVM-based objects, this data must be serialized
to become a Python object, which adds one additional step.
Additionally, there are improvement opportunities to tune Spark using Scala, which does
not exist with Python.14 You could tell Spark to use a special serializer to keep the data on
the nodes in a more optimized format. One example is the Kryoserializer, which reduces the
memory imprint on the servers. This feature is unavailable for PySpark as the notebook
would need typed classes that are not part of PySpark.
13
https://www.youtube.com/watch?v=3jg1AheF4n0
14
https://spark.apache.org/docs/latest/tuning.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.2 Programming Languages 131
5.2.4.3 Readability
Comparing Scala with Python, however, shows the main obstacle to using this language.
Scala is far more complex and more challenging to read than Python. This is easy to repro-
duce. If you ask ChatGPT to write code to load CSV data to Kafka, it will automatically output
code in Python.
Below is the code that ChatGPT returns for Scala. Users shall decide themselves what they
consider more readable.
Figure 5.5 Loading CSV into Kafka using Scala (generated by ChatGPT)
The source code encapsulating the logic to transform data in workflows that will be exe-
cuted regularly on production systems must be stored in a repository. The moment each
change is stored in configuration management systems, it can be tracked.
5.3.1.1 Tooling
The choice of platform for hosting source code will make little difference. Whether it is
GitHub, GitLab, Azure Repos, or any other platform, what counts is that one of them is used.
Data Science Notebooks15 have become tools data engineers and data scientists use. Some-
times, these notebooks contain transformation routines triggered as regular jobs; sometimes,
notebooks are used to do ad-hoc data explorations. Data Science Notebooks are described in
more detail later in Section 5.6.3. Most of the notebooks allow the integration of a source
code repository. Not using version management for any work that requires a structured and
reproducible approach can be considered careless.
Otherwise, there are many ways to connect source code repositories to target systems
through CI/CD pipelines. Target systems are, for example, orchestrators such as Airflow or
jobs deployed for Kafka.
15
https://datasciencenotebook.org/
16
https://medium.com/code-factory-berlin/github-repository-structure-best-practices-248e6effc405
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.3 Software Engineering Processes for Data 133
One best practice is keeping a README.md file documenting the repository in the root
folder.
5.3.2 CI/CD
To manage data pipelines, we must maintain software versions and focus on deployment
strategies. CI/CD (Continuous Integration/Continuous Deployment) has become the stan-
dard term for continued governance over the deployment process.
In this book, we briefly introduce Jenkins and Azure DevOps. Many engineers have had
good experiences with tools like Gitlab, Travis, AWS CodePipeline, CircleCI, and any other
tool designed to deploy source code. This list shows that developers have many choices for
tools, and a book has to limit its exposure to a limited number of tools.
5.3.2.1 Jenkins
Jenkins is a CI/CD pipeline application that can be used to deploy data pipelines. It is a
build automation server written in Java. Most developers are familiar with the Jenkins web
UI and apply configuration settings or trigger builds.
Jenkins in a Nutshell
Jenkins itself communicates with numerous subsystems. For example, once connected to a
supported version management system like Git, Jenkins can track code changes.
Another subsystem is essential in the context of version management. When developers
load a new version from the repositories, we compile it with a tool like Maven or Gradle and
run unit tests to test that the code works correctly.
In larger software systems, where programmers provide multiple daily commits, Jenkins
can create nightly builds that incorporate all changes.
In addition to automating builds, we can implement other processes through Jenkins. For
example, static code analysis provides quality metrics, such as whether complexity, inden-
tation, and naming conventions are adequately executed. In larger projects, ignoring these
metrics can result in unreadable code. PyLint and SonarCube are reference examples of
static code analysis tools for Python code.
Critics, backed by generative AI bots, may call Jenkins challenging to maintain, outdated,
and too complex.
17
Copyright: R. Tyler Croy, https://www.jenkins.io/zh/blog/2016/08/10/rails-cd-with-pipeline/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.4 Data Pipelines 135
Three services are essential for data engineering: Azure Repos, Azure Pipelines, and Azure
Artifacts.
To outline this, the code is stored on Azure Repos, and through Azure Pipelines, this code is
deployed on target platforms. If a pipeline creates artifacts or libraries that are used at a
later stage, they can be stored on Azure Artifacts.
For data engineers working with Azure’s ecosystem, Azure DevOps integrates seamlessly
with other Azure services like Azure Data Factory, Azure Databricks, and Azure SQL Data-
base. This integration is vital for creating a cohesive and efficient workflow from data ex-
traction to insights delivery.
Performance
Users have performance expectations, and source data might be updated at different inter-
vals. Good data pipelines try to satisfy user requirements. This also should address over-
engineering. Updating data on a minute level is optional if data reports are generated only
once a day.
Robustness
When data is being extracted from source systems, many processes can fail. Robustness
means a data pipeline keeps operating even if some subsystems are down.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
136 5 Data Engineering
Idempotency
Jobs that run under specific conditions shall produce the same results if rerun under the
same conditions.
Chapter 4 introduced the Modern Data Stack. We keep using the Unified Data Infrastructure
for this example.
5.4.2.1 Sources
Every data pipeline starts with its data sources. File systems and databases are the most
common sources from which we collect data for analysis. Files often contain text-based log
files, JSON, or CSV data from shared file storage, and usually, the source of databases are
operational database systems. In some scenarios, data is delivered constantly as streams.
Streaming scenarios have different requirements than batch scenarios and require partly
different handling and software components. Other data sources might also need a middle-
ware incorporating specific APIs as large software products restrict full access to their data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.4 Data Pipelines 137
Let’s examine all six aspects of the Unified Data Infrastructure depicted in Figure 5.7 sepa-
rately. The aspect of data sources already requires a lot of prework, even without establish-
ing applications to retrieve them for analytical purposes.
In many data projects, data sources for use cases must be identified. In professional envi-
ronments, nobody grabs data from sources without telling anyone. As a best practice, every
data source has an owner for the data itself and then for the platform on which the data is
hosted. The data owner must decide whether to provide data to analytical use cases. In some
scenarios, they might decline, especially when there are concerns about data privacy. The
owner of the platform on which the data is hosted might be more concerned about the oper-
ational uptime of his platform. They might be suspicious about jobs accessing the platform.
Of course, there are many processes with forms and rules a company can establish to man-
age how data teams can request access to data sources and data source platforms; however,
in many companies, the most important thing is also to have a strategy for negotiating and
arguing to convince owners to give their approval.
18
https://www.baeldung.com/kafka-connectors-guide
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
138 5 Data Engineering
cloud. If you limit the ways to access a system from the outside, the more protected it is from
hacker attacks. Therefore, one recipe is to deploy software to extract data from data sources
within the same secure zone as the data source system is hosted and push data from a se-
cure zone to the cloud. This way, nobody connects from outside into a secure zone.
5.4.2.3 Storage
Data needs to be stored in an appropriate form. In Section 5.5 all options for data storage are
reviewed from a historical perspective.
As a rule of thumb, distributed file systems are commonly used when data sources are un-
structured and/or data needs to be written fast. If you are extracting data from source sys-
tems in which the data is already in a tabular format, it makes sense to push the data to a
database as a target.
Using a fixed schema sometimes requires some transformation before writing to the target
database. This process is also called “Extract-Transform-Load” (ETL). Transformation can be
done after writing when writing to a file system. This scenario is known as “Extract-Load-
Transform” (ELT).
5.4.2.5 Transformation
Data stored on a storage platform can be transformed further to achieve specific analytic
purposes. Data modeling may be the most common type of transformation. Data modeling
aims to bring data into a particular structure. It has multiple possible targets.
Business Intelligence (BI), described in detail in a Chapter 11, transforms the raw data
layers into specific data structures mainly accessed by self-service BI tools, such as PowerBI
or Tableau. The main target is to share insights about what has happened and to be able to
drill down in detail.
Machine Learning (ML) provides a different approach. The idea is to create models from
so-called labeled data sets that predict outcomes when the model is run using input vari-
ables.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.5 Storage Options 139
Models for BI differ from models for ML. We refer to the dedicated chapters for the exact
application of BI and ML. From a data engineering point of view, it is necessary to under-
stand that the data needs to be transformed so these models can be built.
From a process point of view, modeling starts with writing algorithms that filter and trans-
form data from its source format and write it into a target format. Data scientists will then
test these models using various tools. If the model needs to improve, it will be refined until
the output is satisfactory.
In the second stage, jobs get automated, and engineers convert the logic created in this
exploration phase to automated, regularly executed jobs that process incoming data to be
written in a target format.
For this data transformation, we must have frameworks that allow us to process a massive
amount of data and have functions such as filtering, grouping, and mapping data. Apache
Spark is one of the frameworks that is used extensively. Apache Spark is introduced in
Chapter 4. In ML Ops, we dive deeper into how to manage models and how to retrain them.
Files can refer to any file type, but office documents such as Word documents or Excel
spreadsheets may be the most known types for storing business information. We must re-
member that these files are a product of an evolution during initial digitization efforts. For
example, hand-written or typed letters became documents, and shopping lists became
spreadsheets.
The common practice is that applications on a single computer load files from a local hard
drive, as the resources of one workstation are sufficient. Generally, a file owner does not
create that file as part of a vast data processing system or systematic querying. Thus, al-
though spreadsheets allow basic filtering and calculation, they are not designed for complex
queries that span multiple data sources.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
140 5 Data Engineering
On local file systems, business information is isolated and, in most cases, provisioned just
for the computer’s owner. In addition, files on local systems can get lost, and it is harder to
apply governance processes to them. Therefore, companies often want to prevent employees
from storing files on their local computers and use a central document management system
instead. As an additional benefit, document management systems allow the indexing of
documents to make information more transparent.
From a data processing view, business documents are unstructured or semi-structured
sources. On a central, accessible file system, data professionals can parse and explore them.
However, they differ from the common target of large-scale data science projects.
Excel Dilemma
Relational databases were the next logical evolution to enable systems to process data
systematically with complex queries. We mainly differentiate between two types of data-
bases: Transactional and analytical databases. There is also a third category called HTAP
that tries to unify both approaches, but it is outside the scope of this book.
Transactional databases were the first generation of databases to abstract the manual pro-
cessing accounts of bookkeepers. Pioneers such as Edgar F. Codd introduced normalization
processes and methods to map reality in data. On a technical level, transactional databases
focus on topics such as
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.5 Storage Options 141
19
https://www.snowflake.com/guides/htap-hybrid-transactional-and-analytical-processing/
20
https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-5c2be977950e
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
142 5 Data Engineering
While a database works with structured data and enables us to apply standardized queries
to that data, a data lake is a distributed file storage-based platform. Unlike in the file era,
when referring to files of a data lake, we refer to constantly produced files such as log files.
Even if single log files can be small, they can accumulate to a massive load if there are many.
A data lake offers fast data ingest as it is often too time-consuming to convert masses of
incoming data into database-compatible structures. Chapter 4 covers topics related to data
engineering and provides more information on data lakes.
Queries on raw data in a data lake are commonly slow. Applications must determine or
validate the schema when reading the data, a principle called schema-on-read.
A data engineer writes pipelines to transform this data into faster-to-read data structures.
Data pipelines can also help reduce noise and extract only valuable data from raw data to be
processed later (feature extraction). Data processing is often now performed in iterations.
“Serverless” refers not to data storage but to how applications deal with data. Non-server-
less applications are running applications that wait for user input to process data. They
might also trigger various jobs to clean up and monitor processes. Non-serverless applica-
tions are often associated with monolithic applications. In this context, they can be consid-
ered black-box software packages that manage a whole system entirely.
With the rise of microservices and splitting up responsibilities among various small ser-
vices, we also experienced serverless databases. The idea is to avoid constantly running a
database engine waiting for user input. Furthermore, computing power (required for tasks
such as processing data) is only needed when a user triggers a request.
Serverless databases are usually the domain of cloud providers, as their goal
is to minimize idle time. However, the term “serverless” is misleading. You still
need infrastructure to accept client requests, who then, in return, trigger the
next steps. Serverless in that context means that you do not need a continu
ously running server somewhere hosted just for one service. Resources are
spin up on demand.
Suppose we work on a project that leverages multiple data sources and provides data to
other parties through various channels. We could have to deal with
a streaming data source that continuously sends data,
batch processes that are triggered by a cron job every day,
a data provider who sends data in JSON format,
another data provider sends data in custom binary format,
a considerable amount of log files is generated on the fly.
Each data channel could also have different non-functional requirements. Some data must
be processed immediately, whereas others will only be processed on demand.
There is a data storage system that best fits each scenario. For example, data provided as a
JSON document is best stored in a document database such as MongoDB. Files in a custom
binary format might be loaded on a file system and tabular data in a relational database.
We call the concept of having many different storage platforms available for various require-
ments polyglot storage.21
Please study Chapter 3 to learn more about various options for storing data and using cloud
technologies or container orchestration engines such as Kubernetes.
21
https://martinfowler.com/bliki/PolyglotPersistence.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.5 Storage Options 145
A data mesh is a decentralized network for processing data in microservices divided into
data domains. A data domain can be a department or a thematic data collection, such as
customer data belonging to a non-technical department. A data mesh could be interpreted
as an evolution of polyglot storage.
A microservice is a decoupled data service that provides a specific functionality, such as
user authentication or returning status information of a system. The services communicate
via standardized interfaces and are exchangeable without a tremendous effort as long as
their interfaces remain the same.
A service must have an HTTP address (be addressable) that can be assigned to the type of
service (be discoverable) via path information in the URL. In addition, the service must
provide appropriate security by applying state-of-the-art security techniques. Finally, the
data schema must be self-explanatory and trustworthy, which can be ensured by adhering
to company or open standards. This measure helps to solve the following three challenges:
1. Clarifying the data ownership: Who owns this data?
2. Ensuring data quality: Who is responsible for the quality of the data?
3. Avoidance of personnel bottlenecks: How can we ensure that no team is dependent on the
knowledge of one person who is not replaceable?
The data storage of the microservices is still done either in a local database in the micro
service, in the data lake, in the DWH, or a combination of the two (DataLakeHouse).
Conway’s law
22
https://www.melconway.com/Home/pdf/committees.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
146 5 Data Engineering
■ 5.6 Tooling
There are many different tools needed to implement data platforms. In this chapter, three
out of the many tools are introduced as reference technologies. They have been picked be-
cause they are frequently used in projects. There are many alternatives for each category,
and many different engineers will favor one tool over another.
Task Operators
Operators are categories of tasks in Airflow. Different operators can be used for various
tasks, such as PythonOperator for executing Python code, BashOperator for executing Bash
scripts, or SQLOperator for executing SQL commands. Custom operators can also be created
for specific needs.
User Interface
It has a web-based UI where users can manage and monitor their workflows. This interface
shows DAGs, their schedules, execution time, and logs, allowing manual task triggering and
troubleshooting.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.6 Tooling 147
Criticism on Airflow
Critics of Airflow highlight that DAGs that grow out of control become
unmanageable, and it is difficult to create separate environments (testing,
development and production).
Another criticism that sometimes comes up is that Microsoft technologies
libraries lack the sophistication of libraries for other cloud providers. One
aspect that might change with time as Microsoft Azure provides Managed
Airflow services.
This reminds every engineer, if they must look for a solution to also look for
challenges as every technology has them.
In short, data projects often use Airflow as a workflow orchestrator for batch ETL processes,
where complex, dependent tasks must be executed in a reliable and scalable manner.
In Chapter 4, we introduced ETL from a process view. However, it is also helpful to under-
stand how ingestion tools work. Apache Kafka has become the standard framework for load-
ing message-sized data to a data platform.
Size Matters
When we talk about message-sized data, we refer to all kinds of data that
we could load as a data set into a database row. However, plain old file copy
processes do a better job of ingesting blob data, such as files.
the load that Kafka can process. Kafka distributes data on so-called “topics”. We can imag-
ine a topic as a mailbox with an address into which we can post messages at any time. To
ensure that no message is lost, we can ensure that we replicate messages over multiple
brokers. Many messages often have one attribute that functions as a key. Unlike database
keys, which aim to be unique, we pick keys in messages to provide an optimal data distri
bution. Imagine you have datasets of customers with registered home addresses all over the
country. If you put them all in one queue, we must process all data in one queue. If the load
increases, we might run into performance problems.
The knowledgeable data engineer defines a partitioning strategy. He might use, for instance,
the zip code as a distribution key. With that, we can distribute the load. Assuming we have
ten partitions to use all zip codes starting from zero to nine, we can process incoming data
on ten different brokers on different machines.
Other factors for choosing the best possible decision factor include expected querying strat-
egies. These are out of this book’s scope, but interested readers can look up the following
links.23,24 Choosing a good key for partitioning will increase the performance in processing
distributing data.
In Chapter 3, we showed how to use Docker. Use the script below to deploy a mini-Kafka
image on a local computer using docker-compose up.
version: ‘2’
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ports:
- 22181:2181
kafka:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- 29092:29092
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://
localhost:29092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_
HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
After the Kafka service is online, we can verify that using the following command.
nc -z localhost 22181
23
https://medium.com/event-driven-utopia/understanding-kafka-topic-partitions-ae40f80552e8
24
https://newrelic.com/blog/best-practices/effective-strategies-kafka-topic-partitioning
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.6 Tooling 149
We can then use the Kafka binaries from the Kafka client25, which can be downloaded from
the vendor’s site, to use some commands to create topics and load data into them.
We should focus on these commands. The first tells Kafka to create a topic. The second
opens a console window to collect user input. This user input is then sent as a message to
the topic test. Suppose we imagine a physical mailbox and hundreds of letters coming in. In
such a case, it might make sense to distribute the letters into multiple mailboxes to avoid
overflowing one mailbox. Everything is fine as long as we remember which letter is in which
mailbox. If we again think of a physical mailbox, one partition strategy could be to have one
mailbox for each house inhabitant so that, for instance, Mum and Dad have separate mail-
boxes. Figure 5.9 outlines this.
25
https://kafka.apache.org/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
150 5 Data Engineering
Partitioning can be easily confused with replication. Partitioning will increase performance,
as Mum always goes to “Mailbox.Mum” to receive her letters, while Dad goes to “Mailbox.
Dad”. But what if some vandal burns down “Mailbox.Mum”? All her letters would be lost.
With replication, we ensure that every letter delivered to “Mailbox.Mum” is duplicated and
stored in the other mailboxes. To sum up, we use partitioning to increase performance by
splitting messages by keys; replication ensures that no message is lost.
We can then read from a topic again.
We introduced Kafka through one of its core functionalities as a message broker. Messages
are sent to various topics and received from consumers. This strategy allows us to qualify
the first actors in a publish-subscribe system. A producer is an entity that pushes messages
to a topic; a consumer is an entity that loads messages. A producer can be a command-line
tool or code embedded into an application, such as
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.6 Tooling 151
client = KafkaClient(hosts=”localhost:9092”)
topic = client.topics[‘test’.encode()]
with topic.get_sync_producer() as producer:
producer.produce(‘test message’.encode())
A deployment of Kafka in production usually contains many more services. For example,
Kafka Connect is a service that imports or exports data from an external data source. You
could create a MongoDB-Connector, for instance. Once a producer pushes a new message
into a Kafka topic, a connector automatically syncs it as a JSON object in MongoDB. Or con-
sider connecting to a file system as a source. A user might copy a file there; the connector
picks it up and pushes it into Kafka.
But what if a user copies an incorrect file to the source by mistake? Components such as the
Kafka Schema Registry validate the schema of each message. If the message does not repre-
sent an intended form, the Kafka Schema Registry rejects it.
We have Kafka Streams to manipulate and process data in transit. This framework allows us
to manipulate or query Kafka messages while they are in transit.
Many more services are available for a possible Kafka deployment. For example, some UIs
manage Kafka or monitor its workload.
The core feature of Kafka is that it allows everything to scale. The LinkedIn engineering
team created it to manage the workload when they were in danger of not being able to han-
dle that amount of load anymore. While it was impossible to increase the load with existing
messaging systems, using Kafka made it just a matter of acquiring new hardware to meet
new load requirements.
Data engineers often need to harden a deployment with encryption and advanced authenti-
cation methods. This topic is beyond the scope of the book.
Many data scientists may now argue that the internals of technologies such
as Kafka might not be relevant to them and wonder why we present technical
details here. Some people compare a data team with a soccer team with
different roles. Defenders might be systems engineers, midfielders are the
data engineers, and often, all the fame goes to the strikers who are supposed
to score. But we can assume it is clear with whom we compare strikers in this
analogy.
Some people want to be strikers because of their fame. Who would deny that
shooting an essential goal in a match lets a player stand out, even if defenders
contributed a similar amount to the success by preventing goals of the oppos
ing team?
There is also an archetype of a striker who only thinks about scoring goals.
The media often calls them lazy, as they do not want to help out much in the
defense when needed. On the other hand, outstanding strikers do not just wait
for midfielders to pass a ball and head towards the goal. Instead, they make
themselves essential for the overall gameplay.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
152 5 Data Engineering
As the archetype of the “lazy striker” who just waits for the ball from other
players to score, there are also archetypes of data scientists who wait to
receive data to score. They do not care about the rest; all they want is to
apply algorithms and impress the audience with the insights they generated.
Needless to say that in many companies, this archetype does not have
a good reputation. To make it short: To excel as a data scientist, it is also
essential to understand the whole game you are playing
Databricks Notebooks are integral to the Databricks platform, a cloud-based service for
collaborative data analytics and engineering. The platform is widely used for big data pro-
cessing and analytics, particularly in conjunction with Apache Spark. Databricks Notebooks
provide a collaborative environment where data scientists, engineers, and analysts can
write, execute, and share their code and data analyses. Here are some key features and as-
pects of Databricks Notebooks.
Multi-language Support
Databricks Notebooks support multiple programming languages. Users can write Scala,
Python, SQL, and R commands within the same notebook. This flexibility allows a diverse
team of data professionals to collaborate effectively, each using their preferred language.
Collaborative Environment
Notebooks support real-time collaboration. Multiple users can co-edit notebooks, making it
easy for teams to collaborate on data analysis, machine learning models, and more. Changes
made by one user are instantly visible to others.
Interactive Visualizations
Databricks Notebooks support built-in visualizations for data exploration and analysis. Users
can create graphs and charts directly in the notebook, which helps them more intuitively
understand data patterns and insights.
Workspace Organization
Notebooks can be organized into workspaces, allowing teams to manage access and keep
their work structured and accessible. This is particularly useful in large projects with mul-
tiple contributors.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.7 Common challenges 153
Version Control
Notebooks in Databricks can be connected to Git for version control, allowing users to track
changes over time, revert to previous versions, and collaborate more effectively.
In most data platforms, data is sourced from many different sources. Different sources
might mean different standards and different problems when acquiring data.
Let’s show some examples of what can go wrong:
Some data might be scanned from analog sources; although OCR technologies are quite
advanced, they might not be perfect, and there might be some mistakes.
Different data sources might have different standards of how data is represented. Some
use a dot to separate currencies, and others use a comma. Some represent data as ‘yyyy-
MM-dd’ date format, others a structure such as ‘dd-MM-yyyy’.
Some may use the metric system; others may use the American system. Also, the granu-
larity might be different. Some data sources may store “kilo units”, others in mega or even
tera units.
Some data had been entered manually, and it may contain typos or errors. Some import-
ant attributes might also be missing.
Some data can be physically encoded in a different format.
Data sources might structure data differently. In its simplest form, a database might have
one column for the full name, and others might have a column for first and last name.
Data cleaning means data preparing data for analysis by removing or modifying data that is
incorrect, incomplete, irrelevant, duplicated, or improperly formatted.26
For example, imagine clients’ data is stored in many different systems. For an analytical use
case, the data shall be combined. Some call this the golden data set. Cleaning data may also
mean bringing data into a uniform encoding format. Furthermore, cleaning processes can
help standardize data units. Therefore, data provided in length units in the US system, such
as feet and inches, might be converted to a metric system or vice versa.
Data cleaning also deals with missing values, a common challenge when building a machine
learning model. In the article “How to Deal with Missing Data”27, the author recommends
options to deal with them. We can either remove them or impute them. As imputation
method, the author suggests.
Mean, Median and Mode
Time-Series Specific Methods
Last Observation Carried Forward (LOCF) & Next Observation Carried Backward (NOCB)
Linear Interpolation
Seasonal Adjustment with Linear Interpolation
On a broader field, managing data quality goes beyond routines that harmonize data. For
example, suppose a company receives bad-quality data from external data providers, de-
pending on the value of the use case. In that case, the company can negotiate with the data
provider to improve the quality. If sensors collect data, a new generation of sensors might
improve data quality.
26
https://www.sisense.com/glossary/data-cleaning/
27
https://www.mastersindatascience.org/learning/how-to-deal-with-missing-data/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
5.7 Common challenges 155
We might also eliminate inaccuracies using machine learning if we have labeled datasets
with correct data. For telcos, it is sometimes difficult to get accurate geographical positions
of a person using telecommunication data. Many aspects affect the determination of an
exact location, such as signal strength or the position of a person in a building. However,
this data can be improved and made more accurate through triangulation.
Skewed data refers to a scenario in statistical analysis where the data distribution is not
symmetric but leans or stretches more toward one side of the scale. In other words, the data
is asymmetrical, with a longer tail on one side of the distribution’s peak. This characteristic
can significantly impact statistical analyses and the conclusions drawn from the data.
There are two primary types of skewness:
Implications of Skewness
Statistical analysis: Skewness can affect various aspects of statistical analysis. Many
statistical tests and models assume a normal data distribution (no skewness). When data
is skewed, these tests can be less accurate or even invalid.
Data transformation: To address skewness and make the data more symmetric, logarith-
mic, square root, or Box-Cox transformations are often used. These transformations can
stabilize variance and make the data more suitable for analysis.
Descriptive statistics: Skewed data affects the interpretation of descriptive statistics. For
example, in a positively skewed distribution, the mean is typically higher than the me-
dian, as it is influenced more by the long tail of high values.
Data insights: Understanding the nature of skewness in data is crucial for proper inter-
pretation. It can indicate underlying issues or characteristics of the data, such as outliers,
data entry errors, or natural phenomena.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
156 5 Data Engineering
Operational systems often have yet to be designed to be queried frequently. If data jobs
constantly poll a source system, it might have an impact.
Change Data Capture (CDC) is a method used to identify and capture changes made to data
in a database and then apply these changes in real-time or near real-time to another system
or location. It is an essential technique in data integration, as it helps ensure that data in
different systems remain consistent and up to date. CDC is particularly useful in data ware-
housing, replication, and real-time data integration scenarios.
Critical aspects of Change Data Capture include:
Capturing Changes
CDC systems are designed to track and capture insertions, updates, and deletions in a
source database. This is typically done by monitoring the database’s transaction logs (where
all changes are recorded) or using database triggers.
Data Integration
CDC is essential to data integration strategies, particularly in ETL (Extract, Transform,
Load) processes. It allows for more efficient data extraction, as only changed data needs to
be transferred.
Operational systems often do a few jobs, and once they are rolled out and employees are
trained on them, they could exist almost forever. Therefore, the operating system can be
based on hugely outdated stuff. Why should someone upgrade something that works and
everyone is used to? Think about the application that runs on every cashier of a supermar-
ket chain and imagine how complex an upgrade must be.
If someone wants to extract data for analytics decades later, this occasionally means that the
software was not designed to share data in a modern way.
Data engineers might spend more time than expected at the beginning of their career, show-
ing that whatever they do is safe. In the first demo, they show a data pipeline that extracts
data from various sources, combines it, and visualizes the outcome. Many data engineers
will spend significantly more time integrating this demo into a corporate environment. This
often means discussing firewall rules with systems engineers, ensuring the right toolset is
deployed on integration machines and properly maintaining or managing SSH keys and
secrets.
■ 5.8 In a Nutshell
Data engineers build data pipelines to load data onto data platforms.
Data pipelines add and deliver value for companies by (semi-)automatically
applying models to data. The greater the degree of automation, the higher the
added value achieved. There are different channels for ingesting data into a
platform.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
158 5 Data Engineering
Data Governance
Victoria Rugli, Mario Meir-Huber
Picture, for a moment, a ship adrift at sea. Its sails are tattered, its crew weary and desper-
ate. This ship represents the countless data science and AI projects that embark on a peril-
ous journey without the necessary support of robust data governance. The consequences of
this neglect are dire, as they put not only the projects but also the privacy, security, and
trust of individuals at risk. This neglect of data governance in enterprises around the globe
leads to incorrect decisions or failure to deliver a data-driven project.1
Data governance serves as the bulwark against the misuse of data. It not only leads to valu-
able decision making in large organizations, but also safeguards the sensitive information
of individuals from being misused. Without adequate funding for data governance, our soci-
ety is exposed to the horrors of data breaches, identity theft, and the erosion of personal
privacy.
Moreover, data governance is the beacon of transparency and accountability in the world of
data science and AI. It ensures that algorithms are fair and unbiased, that decisions are
explainable, and that the potential for discrimination is minimized. Neglecting funding for
data governance paves the way for the unchecked proliferation of biased algorithms, leading
to unjust outcomes, and perpetuating inequalities.
Data governance is the most neglected part of every data project, even though data quality
has one of the biggest, if not the biggest, influence on the final quality delivered. Many
companies find it hard to justify the necessary investment. However, if a company doesn’t
start adhering to a governance structure right from the start, the cost of implementing it in
a later stage will rise dramatically. On the next pages, we will give an overview of different
contributing factors and costs for failed data governance implementations.
SteelMaster, a steel manufacturing business operating across multiple countries, was strug-
gling with the step towards a data-driven organization. Different software leading to data
silos across divisions such as production, quality control, logistics, and finance made col
laboration difficult. One example was the interpretation of the “heat index”, a term that held
different meanings based on each individual department’s own definition. In some divi-
sions, the “heat index” referred to the intensity of the steel production process, while in
others, it signaled the climatic conditions of steel storage. This led to confusion, inefficiency,
and conflicting reports. This is a common problem within large organizations, as separate
departments define KPIs differently according to their needs. This is mostly due to siloed
information not being communicated centrally.
Data lineage: The introduction of data lineage helped shed light on the journey of SteelMas-
ter’s data, from its inception to its final utilization. This newfound transparency showed all
the operations, from data collection and transformation up to utilization, enabling the iden-
tification and resolution of discrepancies, inaccuracies, and inconsistencies related to the
1
Brous, P., Janssen, M., Krans, R.: Data Governance as Success Factor for Data Science. Responsible Design,
Implementation and Use of Information and Communication Technology. 2020 Mar 6;12066:431–42.
doi: 10.1007/978-3-030-44999-5_36. PMCID: PMC7134294
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.1 Why Do We Need Data Governance? 161
“heat index”. Data lineage played an important role in aligning terminologies and interpre-
tations.
Data catalog: SteelMaster also introduced a data catalog, which served as the central repos-
itory for all data-related information. This catalog offered a user-friendly, searchable inven-
tory of data assets, complete with uniform definitions, usage guidelines, and ownership
details related to the “heat index”, and other important business terms. The result was a
consistent interpretation of the “heat index” across all divisions and locations, mitigating
confusion and fostering a shared understanding. Now, if any department required specific
data related to the “heat index”, they could effortlessly consult the data catalog, ensuring
alignment with other departmental metrics and refer to the shared knowledge base.
The implementation of these data governance tools enhanced communication, more accu-
rate decision-making based on comparable values, and heightened operational efficiency.
ConnectY, an international telecom company, started out with issues in data quality. Their
data strategy was built on fast business implementations, which meant that over the years,
many quick fixes had been applied to their data models and structures. The result? Several
data models, each conflicting with the other, such that only top experts could build a single
source of truth. This led to several issues, and the speed of implementation decreased with
every iteration cycle. At some point, management decided they needed to do something
about it, as the issues at hand were so pressing.
One of the key issues they identified was that KPIs were defined differently across depart-
ments. The same KPIs were calculated inconsistently, leading to different numbers in met-
rics such as:
Customer churn rate:
Definition A: Calculated as the percentage of customers who terminated their subscrip-
tions in each period.
Definition B: Calculated as the percentage of inactive SIM cards for which there had been
no usage activities in the last 30 days.
Average revenue per user (ARPU):
Definition A: Total revenue divided by the total number of active subscribers.
Definition B: Total revenue divided by the total number of SIM cards issued.
By creating a data governance board, the harmonization of key KPIs was accomplished. This
led to better steering of the business and a joint understanding of the company’s perfor-
mance. In a later step, the data models were re-designed and a single, common data model
was created. This led to faster development times and data as a service. With data as a ser-
vice, employees of the company could retrieve data in a self-service manner.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
162 6 Data Governance
People
This section can be read as a job posting for your data governance board: what kind of
employees are needed for the job of data steward, and who among the existing talent within
your teams can take on the challenge of data ownership. It also explains their responsibil
ities, and how they will need to interact within the system. Further, this section describes
the importance of communicating a change within the processes, to ensure the continuity
of the new data governance effort. In essence, this is where the organizational structure and
the culture shift needed for a successful data governance initiative are explained.
2
Madsen, Laura B. (2019): Disrupting Data Governance: A Call to Action
3
Dama International. 2017. DAMA-DMBOK: Data Management Body of Knowledge (2nd Edition)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.2 The Building Blocks of Data Governance 163
Processes
Processes are the core of effective data governance, ensuring that governance activities
have a defined and working process behind them, which are recognized within the organi-
zation. In this section, we’ll explain the five pillars of processes, putting a focus on ensuring
data integrity, security, and usability. By addressing all these pillars, organizations can
ensure efficient data governance processes that effectively meet their data governance
objectives.
Technology
Lastly, we analyze a short list of popular data governance tools using the DAMA framework
as a background, showing their strengths with respect to supporting the aforementioned
people and processes. Both on-prem and cloud native data governance tools are considered,
and we analyze both open-source and proprietary tools by leading cloud providers. However,
the focus is set on centralized data governance efforts. As the landscape of tools is evolving,
only the current capabilities of the tools can be compared.
Data governance originated from data quality efforts, as a certain level of data quality has
always proven itself to be important. Bad quality data leads to bad overall decisions. Data
governance efforts tend to increase drastically when not done at the very beginning of your
data strategy, as the effort to add a governance layer on top of an existing structure adds
difficulty to every data governance layer beneath it.
Before each data governance initiative, the goals, objectives, and expected outcomes should
be defined, in order to avoid spending time on secondary activities that are not crucial for
the project’s success. Data governance is an infinite activity within a company, as data con-
stantly changes, evolves, and is used. Thus, the success of initial implementations of data
governance is difficult to measure.4 Yet this is not the only challenge in data governance;
there are many more:
Maintaining a high level of data security often slows down business implementations.
Initial investments are necessary, yet don’t show value for months or even years.
Benefits are only visible through other metrics such as generating faster business results
or better insights. Thus, it is not easy to quantify the direct impact.
Data governance is often considered tedious and time consuming and does not have the
same appeal as other practices centered around data. Everybody talks about data science,
but nobody about data governance. Yet the reality is, data scientists can do almost nothing
without data governance.
Data governance tools are rare, and those that are available are very expensive. The open-
source community doesn’t focus too much on this topic, as there is less “buzz” around it
than AI. However, this also creates opportunities for companies.
4
Madsen, Laura B. (2019): Disrupting Data Governance: A Call to Action
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
164 6 Data Governance
Companies can follow three different strategies. Each requires and is suited to a different
level of maturity:
Reactive governance: Efforts are mainly designed to respond to current pains. This hap-
pens when the organization has suffered a regulatory breach or a data disaster.
Pre-emptive governance: The organization is facing a major change or threat. This strat-
egy is designed to ward off significant issues that could affect the company’s success.
Often, it is driven by impending regulatory and compliance needs.
Proactive governance: All efforts are designed to improve capabilities to resolve risk and
data issues. This strategy builds on reactive governance to create an ever-increasing body
of validated rules, standards, and tested processes. It is also part of a wider information
management strategy.
Ideally, data governance is executed proactively. This provides many benefits, while reduc-
ing various risks. If not done in a proactive strategy, data management policies and controls
are often overlooked as they are not put in place by design. This leaves an organization
vulnerable to data breaches, data quality issues, and non-compliance with regulations,
which can lead to costly repercussions. Additionally, many industries have stringent data
protection and privacy regulations, such as GDPR or HIPAA. A reactive approach may result
in non-compliance, leading to fines5, legal issues, and damage to an organization’s repu
tation.
Without a proactive governance strategy, data can become inconsistent and unreliable,
affecting decision-making processes. Inaccurate data can lead to poor strategic decisions
and operational inefficiencies. Subsequently, this leads to higher costs, as resolving data
issues after they have emerged can be more expensive than preventing them in the first
place. Data breaches can result in substantial financial losses.
Proactive data governance allows organizations to leverage data as an asset, enabling better
insights and innovations. This is a must-do if a company wants to become data-driven. A
reactive approach may hinder these opportunities, due to the lack of data quality and read-
iness.
A big disadvantage of reactive and pre-emptive data governance strategies in comparison to
proactive data governance is the potential for data breaches, which lead to public distrust.
Public trust is critical for any organization. Reactive data governance can lead to privacy
violations, damaging an organization’s reputation and eroding customer trust.
Finally, a reactive approach can also divert resources toward fixing data issues, when those
resources could be better used for more strategic initiatives. It can result in a constant cycle
of firefighting, instead of driving value from data.
5
WhatsApp fined 225 million for GDPR violations; https://www.euronews.com/next/2021/11/22/whatsapp-
rewrites-its-europe-privacy-policy-after-a-record-225-million-gdpr-fine
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.3 People 165
■ 6.3 People
At the forefront of any initiative are the people leading the way. This is also true for data
governance, as a successful governance project is dependent on the collaboration of all
stakeholders. One of the base concepts of data governance is the implementation of an own-
ership and stewardship framework, as well as a data governance board, leading the initia-
tive and advocating for change within the organization.
If the core goal is for users to feel enabled to make use of the data at their disposal, then
there have to be role models that act as facilitators and mediators between departments and
end users and the IT department, which has historically been in charge of leading the data
governance initiative. But before the board, two specific roles are crucial in ensuring the
success of every data governance initiative. These are the data owners, who are usually the
representatives of the different departments, and the data steward, who acts as a project
manager for all things data governance.6
Figure 6.2 The relationship between the roles in a data governance initiative
Data ownership is one of the foundational concepts of data governance, and refers to the
business ownership of different sets of data within the company. It establishes clear lines of
responsibility and accountability for data accuracy, quality, and reliability across different
business branches, departments, or units. The way data ownership is defined, either verti-
6
Plotkin, D. (2020): Data stewardship: An actionable guide to effective data management and data governance.
Academic press.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
166 6 Data Governance
cally or horizontally within an organization, depends on the overall data government struc-
ture the company adheres to.
7
Micheli, M., Ponti, M., Craglia, M., & Berti Suman, A. (2020): Emerging models of Data Governance in the age of
datafication
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.3 People 167
In Practice
Let us now consider a real-life example of a retail company practicing horizontal data own-
ership. The sales department may own the data of the customer and sales systems, while
the human resource department may its own employee data. The sales department might
work with a CRM tool and a sales tool, such as Salesforce, while the HR department may use
BambooHR, Workday, or other, similar tools. Either party may not know how to use the
other’s tools, nor what data is collected within that tool, but they have a clear understanding
of the data collected within their tool and department, as well as who has access to it and
who can change or add to it. This is the reason why data owners are set within the depart-
ments they own the data for, as they know their data best. Since they use the tools, the data
is used daily, and they know the business needs and constraints the data must adhere to.
Conclusion
Horizontal data ownership emerges as a strategic approach to data governance, leveraging
sector-specific expertise for enhanced accuracy, quality, and reliability. By aligning data
ownership with departmental functions, organizations can navigate the complex seas of
data governance with agility and precision, ensuring that each sector plays a crucial role in
steering the ship toward successful and secure data management.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
168 6 Data Governance
In the field of data governance, the role of data stewards stands in stark contrast to that of a
data owner. While data owners have a focused responsibility within specific departments,
data stewards assume a more global role, overseeing and managing data governance ini
tiatives across data in a company. They oversee and manage data governance initiatives
throughout the organization and act as the bridge between business departments, IT,
advocating for data management best practices while ensuring that data governance poli-
cies are adhered to.
The link between IT and Departments
Data stewards link technical data details and the broader business context, translating com-
plex data insights into actionable information. Through their daily involvement, they ensure
that data governance remains an evolving aspect of an organization’s operation, adapting to
the changing needs and challenges of the data. They act as the managers of stakeholder
interests, weighing the importance of each decision. They are considering the importance of
the use of the companies’ data while building the governance rules that protect the data. An
often-overlooked aspect of data governance is user training and communication. Educating
users about the new processes, search functionalities, and the importance of responsible
data usage is critical for the success of any governance initiative. This role of communi
cation is centralized within the data steward.
The Global Role of Data Stewards
A data steward is usually part of the IT department, and is the champion of all data used
within the organization. They should be able to tell what source system a dataset lies in and
who controls or owns each dataset. After a certain size of an organization, a data steward
will need tools supporting them, as well as a large system of data owners working with
them.
Within the ship analogy, the data stewards are the helmsmen, steering the ship in a joint
effort. They know the ship best, and ultimately have the responsibility of keeping it on
course, all while helping the crew work together. The data stewards are the sailors of the
data governance ship, advocating for changes in the course and making sure all the parts
are working.
Responsibilities of Data Stewards
1. Data policy development
Active contribution: Integral to the inception of every data governance initiative is the
requirement to actively contribute to the development of data governance policies
aligned with the organization’s goals.
Daily enforcement: Once policies are in place, enforce these policies daily, ensuring a
consistent and compliant approach to data governance.
Continuous alignment: Continuously align policies with evolving organizational needs
and industry standards, advocating for a dynamic and adaptive governance framework.
2. Data documentation
Metadata oversight: Oversee the administration of metadata, ensuring comprehensive
documentation of data elements and their characteristics.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.3 People 169
In Practice
To use a different industry as an example, the data steward of a hospital will be responsible
for all types of data, including the accurate lineage of patient care information. They work
within the IT department and must be able to tell where patient care data is updated and
what systems have access to that data. If a chart shows that a patient moved from intensive
care to stationary care, the data steward will be able to tell on what system the input of that
change has been made and where these changes in the patient’s status will appear.
Data is mostly created in decentralized business units, and a key aspect of data is that it is
produced in the corresponding business department. Data about a customer is often pro-
duced in customer-focused business functions such as customer care or marketing. This
leads to several challenges when working with data: who owns a specific dataset? Often, it
this not so easy to determine. Continuing with the customer data example: the marketing
team might store data about customer interactions in shops, websites, or call-centers.
They’re interested in what this data can reveal about how to upsell other products to custom-
ers. The customer experience team is also interested in this data, but from a different per-
spective: they want to improve the service so that the customer doesn’t have to call the
customer service line for basic information like information on past orders. Different data
points about the customer are relevant for that, and different business units may each con-
tribute different data points about the customer.
Let’s use another example, this time with a finance department and marketing department.
Finance is interested in the contracts a customer has and what money this brings per
month. This needs to be very accurate, since bill runs are normally executed towards the
end of the month, and any inaccuracies lead to potential losses in that specific month. In
this case, finance may only look at the data at the end of the month. Marketing, on the other
hand, might rather be interested in how many customers they won in the month, and how
many canceled their contract only a couple of days after signup. This leads to diverging
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
170 6 Data Governance
metrics for number of customers, since finance only looks at the number at the end of the
month, whereas marketing might count the maximum number of customers during the
month.
To avoid such challenges, it is common—and very beneficial—to raise them in a “data gover-
nance board”. This board is ideally led by someone independent, such as the data gover-
nance officer or chief data officer. In this board, all data stewards are present. The data
stewards from different units discuss different data topics, such as the two samples from the
marketing and finance department initially mentioned. The board then agrees on a common
way to treat different data-related topics and creates guidelines. The data governance board
acts as a guiding force, ensuring that data is treated as a strategic asset, managed effec-
tively, and used in ways that align with the organization’s goals and adhere to ethical and
legal standards.
The four key functions of the data governance board are:
1. Policy development:
They develop, revise, and approve data governance policies, ensuring they are compre-
hensive, clear, and aligned with organizational goals.
2. Issue resolution:
The data governance board addresses conflicts or issues related to data management
and acts as a referee weighing different stakeholder interests.
They make sure the issues have a timely resolution so that the initiative can move for-
ward.
3. Decision making:
They are responsible for strategic decisions on data-related matters, guiding data u
sage,
access controls, quality standards, and compliance.
4. Reporting and communication:
The board is charged with reports to executive leadership or stakeholders about the
progress, challenges, and achievements in data governance efforts.
They also communicate the importance of these efforts and advocate the need for data
governance, sometimes even acting as the spokesperson for budgeting questions.
Typically, the data governance board comprises representatives from various departments
or functional areas within the organization. This includes data stewards, data architects, IT
personnel, compliance officers, business analysts, and sometimes executive-level manage-
ment.
Becoming a data-driven organization is not only a technical task. A topic that is often ne-
glected in digitalization programs and data transformation projects is the impact it has on
people, and how people deal with that. Many initiatives fail because of a lack of integration
of employees into such programs. This is where change management comes into play: it
deals with the human factor.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.3 People 171
it is helpful for external trainers and people with a background in psychology to support
change managers to adapt to new processes and analyze uncertainties and potential resis-
tance to change.
■ 6.4 Process
The next building block after people is processes. With processes, we now define all key
elements in data governance that are handled by people and executed by technology. Pro-
cesses are the backbone of effective data governance, playing a pivotal role in ensuring data
integrity, security, compliance, and usability.
There are five key pillars when it comes to processes in data governance:
Metadata management: understanding the content of the data.
Data quality management: ensuring good quality of the data.
Security and privacy: making data secure, and complying with regulations.
Master data management: having a single source of truth for your most important data.
Data access and search: enabling access and discovery of data within your company.
Each of these pillars will now be described in more detail.
Imagine you have a new use case to work on, for which you need company data. You then start
searching for the data in your database and get confused. Have you ever experienced that?
Most people do. This happens for several reasons, and missing metadata is one of them.
Metadata can be thought of as “data about data.” It describes how to best work with the data,
what kind of attributes you can find within it, and what values you can expect. To illustrate
why this is important, let’s assume we have the following table schema:
ID
cust_fname
address
You would know exactly what to expect from the data without a description, right? Perhaps
it’s not so easy. What does “cust_fname” mean? Does the “f” stand for “full name” or “first
name”? How is the address built? Does it include a zip code? How are the elements within
the address separated, and where do you find the street name?
In short, in this example, it is ambiguous what you will find in the data. Therefore, it is
necessary to clearly describe it. We could extend example as follows:
ID: a uniquely defined UID
cust_fname: the full name of the customer, starting with the first name. The last name is
separated by a comma “,”.
address: the address of the customer, starting with the street, followed by the zip code and
city. The three items are separated by a comma “,”.
To do a comprehensive metadata management, several steps should be done:
Step 1 – Collecting metadata: The first step is about gathering the metadata from the dif-
ferent systems. Here it is important to have domain know-how about a specific system. IT
and business experts are necessary to ensure that all relevant information is collected. This
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
174 6 Data Governance
is the first step where data owners are responsible for the correct collection of data. Differ-
ent systems store data differently, even when it represents the same entity, such as a cus-
tomer. Metadata can also be produced from mobile or web sources.
Step 2 – Organizing metadata: When the data gets collected, it is necessary to organize the
metadata on the different entities, like customers. Schemas or standards are typically used
for that, often coming from systems which have their own schemas and standards that can
be reused. It is also necessary to add information about different fields. An example for this
is the different layout of zip codes, such as the UK layout (in letters and numbers) or the US
layout (numbers only).
Step 3 – Storing metadata: Once Metadata has been collected and organized, it must be
made available for consumers to use. Many different metadata management solutions exist
for this. Additionally, modern data catalogs provide metadata capabilities, and users have a
central repository to view the metadata.
Step 4 – Maintaining metadata: The work of metadata management doesn’t just end after
the first three steps are done. Data is always in motion; it changes, and so do schemas and
metadata. In today’s dynamic world, with agility in the development cycles, it is necessary
to keep up with the changes. This means metadata needs to be constantly checked to ensure
it is still accurate and up to date. This can be done by cooperating with the business units
owning the technical systems, such as an app or the website. Data stewards are normally
the go-to persons in an organization for keeping track of these changes.
Step 5 – Using metadata: Eventually, users need to use the metadata. Technical implemen-
tations are worth nothing if not used. Therefore, extensive change management is neces-
sary, and skills and training programs need to be started within an organization. Change
management can help users understand how they can use metadata to their advantage,
empowering employees to use new tools and helping them understand how they can serve
their needs.
Technical metadata: Technical metadata provides detailed information about the technical
aspects of data, files, or digital resources. It focuses on the characteristics, properties, and
technical specifications of the data, which can be essential for the effective management,
processing, and use of the data. Technical metadata helps users and systems understand
how data is structured, formatted, and stored, as well as the conditions under which it was
created or acquired.
An example of technical metadata is:
File “customers.json”:
Having proper metadata management is the first step to increasing the quality of your data.
But what does good quality data actually look like? We’ll see this in the next section, where
we’ll take a look at the different attributes that make up data quality.
A key aspect many people think of when talking about data governance is data quality.
Though it is only one aspect of data governance, it is a key pillar of it. The goal of data gov-
ernance is often to increase data quality. Therefore, it is important to make data quality
measurable.
Figure 6.5 shows the six key factors upon which data quality can be measured:
Availability
Availability means data should be accessible. This pertains not to the general availability of
the database but to the availability of each individual dataset. When querying all existing
users interested in luxury cars, for instance, we are not interested in a subset of the data but
rather, its entirety. Thus, the querying algorithm should be as effective as possible to re-
trieve all available data. There should be user-friendly tools and languages for accessing the
data. Typically, databases provide this by offering developers a query language such as SQL,
or object-relational mappers (O/R-Mapper). O/R-Mappers are a programming technique
used to convert data between a relational database and an object-oriented programming
language, creating a virtual object database accessible within the programming language.
Availability also implies that the data used for a specific use case should be accessible to
data analysts within business units. Data that is relevant to a marketing campaign might
exist but not be available for the campaign. For instance, the company may have certain
customer data available in the data warehouse, but business units may be unaware that the
data exists. This is a problem.
Consider a luxury car dealership aiming to analyze the preferences of all customers inter-
ested in high-end vehicles. The availability of data becomes crucial in this scenario. The
database may contain information on these customers, such as their purchase history, pref-
erences, and contact details. The organization wants to utilize this data for a targeted mar-
keting campaign to promote new luxury car models. In this case, the issue of data availabil-
ity also extends beyond the technical realm. Despite having the necessary data in the data
warehouse, there might be a disconnect between the data analysts and the business units.
For instance, certain customer data may be present, but the marketing team might be un-
aware of its existence. This gap underscores the importance of seamless communication
and awareness within the organization, ensuring that the data is not only available, but
effectively utilized for strategic decision-making, such as launching a targeted marketing
campaign for luxury cars.
Accuracy
Accuracy implies that data must be precise. When querying all existing users on a web
portal dedicated to luxury cars, the data should accurately reflect the relevant information.
Precision entails that the data genuinely represents individuals interested in luxury cars,
and any erroneous entries should be eliminated. A dataset is also considered inaccurate if a
user has changed their address without the company’s data reflecting the change accord-
ingly. The display of multiple addresses without a note showing the most recent one are
confusing. Therefore, it is imperative to track when each record was last updated, to ensure
data accuracy.
Consider a high-end automotive website that caters to users interested in luxury cars. En-
suring data accuracy in this context is essential for targeted marketing efforts. Suppose the
company decides to query all existing users to refine their customer profiles for a new pro-
motional campaign. Data accuracy comes into play during this process. Each user’s informa-
tion, including preferences, contact details, and purchase history, must be precise and up to
date. For instance, if a user has recently relocated, the database needs to reflect this address
change. Additionally, it is crucial to verify that the data genuinely represents individuals
interested in luxury cars, and any fake or outdated entries should be promptly removed.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
178 6 Data Governance
Imagine a scenario where a user has changed their last name, but the company remains
unaware of this change. In such cases, the dataset would be deemed inaccurate. To mitigate
this, the company needs a system in place to track when each user’s record was last up-
dated. This ensures that the marketing team works with the most recent and reliable data,
enhancing the accuracy of their customer targeting efforts.
Completeness
The data should exhibit completeness. Addressing all users interested in luxury cars be-
comes meaningful only when we can engage with them through a viable channel, such as
email. If the email field is vacant or any other field pertinent to our user communication
remains incomplete, the data lacks completeness for our intended application.
Let’s return to our high-end automotive platform, and imagine that they want to communi-
cate exclusive offers on luxury car models to their user base. In this instance, having com-
plete data involves ensuring that every user has a valid and accessible email address re-
corded. If the email field is left blank or contains outdated or inaccurate information, the
company’s ability to reach and engage with its target audience is compromised. Therefore,
completeness in data is not merely a technical requirement but a strategic imperative for
effective communication and engagement with users interested in luxury cars.
Timeliness
The timeliness of data is imperative. Users may alter their email addresses over time, and
our database should promptly and comprehensively reflect such changes. When aligning
our users with luxury cars, it is far from optimal if only 50% of the user email addresses are
accurate. The presence of “Big Data” loses its luster when accuracy falters due to a lack of
timely updates.
Suppose that our upscale automotive platform now wants to implement personalized user
experiences. A user, initially interested in sports cars, may have evolved their preferences
over time, and is now expressing a keen interest in luxury SUVs. If the database fails to
promptly capture and update this shift, targeted communications, such as promotional
offers and product recommendations, become misaligned. In this scenario, data timeliness
transcends mere technicality; it becomes the linchpin for delivering a sophisticated and
responsive user experience.
Consistency
This should not be confused with the consistency requirement of the CAP theorem. Data
duplication may occur as users may register multiple times to avail themselves of different
benefits. A user might select “luxury cars” with one account and opt for “budget cars” with
another. Duplicate accounts give rise to data inconsistencies and are a prevalent challenge
in major web portals, exemplified by platforms like Facebook.
Users on the platform, driven by diverse preferences, may register multiple times to explore
distinct facets of the automotive world. For instance, a user could initially express an inter-
est in “luxury cars” with one account, seeking tailored information and promotions in that
category. Subsequently, the same user might create a separate account with a preference for
“budget cars,” aiming to explore alternative options and incentives.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.4 Process 179
Herein lies the challenge: the existence of duplicate accounts. While each account inde-
pendently reflects the user’s authentic preferences, the presence of multiple registrations
can introduce data inconsistencies. For instance, a marketing campaign targeting users
interested in luxury cars might overlook the fact that the same individual also expressed
interest in budget cars through a different account.
Understandability
Understandability in data representation is crucial. When querying our database for indi-
viduals interested in luxury cars, it should be straightforward to comprehend the substance
and contents of the data. Upon retrieval, our preferred data exploration tools should seam-
lessly facilitate interaction with the data. The data itself should be self-descriptive, provid-
ing clear indications of how to interpret and manipulate it. For instance, if the data yields a
column labeled “zip” (postal code), we should readily discern that this represents the postal
code of individual users’ residences. This transparency in data structure enhances usabil-
ity, ensuring that users can readily navigate and harness the information effectively.
Imagine our high-end automotive company employs advanced analytics to target potential
buyers for their luxury car models. In their database, a query is initiated to identify individ-
uals interested in specific features like high-performance engines or cutting-edge safety
technology. Upon receiving the data, clarity in its representation is paramount. The dataset,
well-structured and labeled, uses intuitive descriptors. For instance, a column titled “Engine
Type” clearly specifies whether a user prefers “V8” or “Electric.” Another column, labeled
“Safety Rating,” unambiguously indicates users’ safety preferences.
The profound impact of understandability in data becomes evident when a data analyst,
tasked with refining the targeted marketing strategy, can swiftly interpret and leverage the
information. The analyst doesn’t need to decipher cryptic codes or navigate through con
voluted spreadsheets. Instead, the transparent and self-descriptive nature of the data allows
for immediate comprehension.
In this scenario, the data not only represents user preferences accurately but also empowers
decision-makers to effortlessly derive insights. The understandability of the data facilitates
a seamless transition from raw information to actionable intelligence, ultimately enhancing
the efficiency and effectiveness of the marketing strategy.
Data security and data privacy are essential aspects of data governance. They are often
achieved together with other functions within the company, such as the information secu-
rity department, and thus need a lot of interdisciplinary know-how. This also means that
data security and privacy are often not utilized within the data governance organization,
but that their orchestration emerges from a data governance organization.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
180 6 Data Governance
As shown in Figure 6.6, there are four main layers to data security. Let’s deep dive into the
four layers of the pyramid now:
Large organizations handle vast troves of data, and this data must remain accessible to au-
thorized personnel who require it to perform their roles effectively. For example, financial
analysts need access to financial records, but they may not require access to customer
health records. By adhering to the need to know principle, organizations can minimize the
risk of data breaches and accidental exposure, ensuring that sensitive information is shared
only with those who have a legitimate reason to access it.
Master data management comprises various tasks and processes to ensure that master data
is correct, up-to-date, consistent, and complete. It begins with the capture of master data and
ends with integrating that data into target systems. Between capture and integration, sev-
eral continuous steps are executed, which include maintenance, ensuring consistency, qual-
ity assurance, and access control. The respective steps are:
Capture: Master data is recorded and stored in a database or system. This can be done
manually or automatically.
Maintenance: Regular updates and master data maintenance are necessary to ensure its
correctness and currency. Changes such as addresses, contact details, or product infor-
mation must be appropriately documented.
Consistency: Master data must be consistent across different systems and departments.
Data should be used uniformly and without contradiction throughout the organization.
Quality assurance: Monitoring and ensuring the quality of master data involves identify-
ing and correcting errors, duplicates, or incomplete records.
Access control: Access to master data should be controlled and restricted to prevent un
authorized changes or deletions. Clear guidelines and processes for data security and
privacy should be established.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.4 Process 183
Integration: Master data might need integration with other systems or data sources to
ensure smooth information flow. For instance, this could occur during the synchroniza-
tion of customer data between a CRM system and an ERP system.
A golden record aims to solve this issue by creating a single, merged, and consistent version
of a company’s customer data. It combines relevant information from different sources and
cleanses any discrepancies or errors. The result is a complete and accurate customer data-
set, acting as a trusted source for analysis, reports, marketing activities, and other business
processes.
Creating a golden record involves data integration techniques, data cleansing, matching pro-
cedures, and, potentially, manual checks. The objective is to ensure that the golden record
exhibits the best quality and completeness, serving as a single source of truth for that data.
A golden record can reduce data inconsistencies, enhance business process efficiency,
create better decision-making grounds, and optimize the customer experience.
In column-based security, data access is controlled at the column level. It specifies which
users or user groups have access to specific columns in a table. This allows differentiated
control of access to individual data fields or information. An example of column-based secu-
rity is an employee database, where columns like salary or social security number are marked
as sensitive. Only authorized users or user groups have access to these sensitive columns,
while other users only have access to non-sensitive columns like name or position.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.4 Process 185
In this subchapter, our focus shifts toward a more practical aspect of the data governance
process: the access to data, and the optimization of the ability to search within data. This
chapter outlines the significance of creating clear guidelines for data access, considering
metrics such as data sensitivity, regulatory compliance, and organizational roles.
We introduce the strategic implementation of role-based access control (RBAC), which is
about aligning specific permissions with designated roles to enforce the principle of least
privilege and facilitate data access. Additionally, we explore the role of effective using struc-
tured metadata and a centralized data catalog, highlighting its contribution to organiza-
tional efficiency and informed decision-making.
Write access
Allows users to create, modify, or delete data. This type of access is required for functions
involved in entering, updating, and maintaining data.
Execute access
Allows users to perform specific or predefined actions on the data. This applies especially
to activities that require data-driven transactions.
Delete access
Enables users to delete data. It is important to carefully consider and set limits to prevent
accidental or malicious data deletion.
Data Search
Effective data search is one of the core benefits of a properly managed data estate with
structured metadata management, as it enables the implementation of a data catalog. Meta-
data includes information about data such as its origin, format, and usage, and facilitates
the discovery and understanding of available data sets. Establishing metadata standards
and consistently applying them across the organization is key to enhancing search capabil-
ities.
Data Catalog
A data catalog serves as a centralized repository for metadata and plays a pivotal role in
enabling efficient data discovery. It provides users with a searchable inventory of available
data assets, including information on data lineage, quality, and usage. Integrating the data
catalog with other data governance components, such as data dictionaries and business
glossaries, that adhere to industry standards, further enhances its value. This is because
every company has its unique structure; as a company changes and evolves, the catalog can
evolve with it, but it will also save the history of the changes. In the age of quick transitions
and fast growth, this becomes crucial to understanding data that is more than a decade old.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.5 Technology (Data Governance Tools) 187
Semantic Search
Recent developments have shown a strong push beyond traditional keyword-based search
and towards semantic search. It leverages natural language processing and machine learn-
ing to understand the meaning behind search queries. By enabling users to search for data
in a more intuitive and context-aware manner, semantic search enhances the efficiency and
effectiveness of data discovery.
AI-Powered Search
After the quick rise of generative AI, search functions have experienced a revolution, as
traditional search engines have been replaced by natural language models that answer
users’ questions in full sentences. With the goal of these AI systems being to search large
amounts of data, data governance at the source of these queries has gained importance. If
the source is biased, inaccurate, or of low quality, then the model will suffer from a lack of
quality in the training data.
Within the framework of Natural Language Processing (NLP)-powered search functions,
many organizations have opted to enrich the classical models currently on the market with
the company’s own data. They add organizational data into the existing model, making it
possible to query both structured information like databases, as well as unstructured data
such as emails and pictures.
Many large enterprises offer their own data governance tools. Where some started with an
“Excel-based” data catalog—often due to budget issues—others like LinkedIn or Uber devel-
oped a very comprehensive data governance solution. In this sub-chapter, we will explain
the two most common open-source solutions for data governance.
6.5.1.1 OpenMetadata
OpenMetadata is an open-source project that helps companies with data governance. It of-
fers an infrastructure to deal with the data lifecycle from data creation up to consumption
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
188 6 Data Governance
and allows organizations to oversee their policies, trace data lineage, and track quality. It
is a platform for discovery, observability and governance, powered by a central metadata
repository. It includes functionalities such as in-depth lineage and is among the fastest-
growing open-source projects in the world.8
People
Roles
Roles and responsibilities for data governance are assigned to company employees or any
other stakeholder responsible for tasks and carry out a broad function. This includes access
management and options to allow review alterations and accept or reject transformations
made on metadata. It also uses ownership metadata to disclose the contact people for an
asset, so that they can help with potential questions.
Empowering Users
Access control, business glossaries, and classification tags, as well as versioning of data
entries, are provided by OpenMetadata to ensure data integrity and compliance. Enriched
metadata context and search capabilities make locating or understanding data assets easy,
while business glossaries and classification tags help users understand the data.9
8
https://docs.open-metadata.org/v1.2.x/features
9
https://docs.open-metadata.org/v1.2.x/features/discovery-collaboration
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.5 Technology (Data Governance Tools) 189
Process
Data Lineage
OpenMetadata helps streamline data governance processes, as it provides transparency
into the data flow through lineage and enables easy ways of implementing effective prac-
tices when dealing with that process. For instance, it provides impact analysis tools that
users can use to evaluate the impacts of changes at the source level.
Data Quality
Additionally, data quality initiatives are included, such as establishing data quality tests
and using data profiling to ensure the integrity of data products. OpenMetadata’s native
tests cover a range of data quality dimensions, helping to measure accuracy, completeness,
reliability, relevance, and timeliness of the data in question.
Technology
AI-powered labelling
OpenMetadata leverages AI and machine learning for data labeling, classification, and sen-
sitivity labeling. This allows companies to classify and label data according to its sensitivity,
aligning with compliance requirements, and ensuring data security.
Connecting Data Sources
OpenMetadata supports connectors for a variety of data services powered by OpenMetadata
Standards and API metadata administration across all data sources, allowing companies to
add various data sources to their metadata repository.
6.5.1.2 DataHub
DataHub is a project that was originally launched on LinkedIn and AirBnb. The key devel-
opers of these companies (and others) joined forces and created the open-source platform
“DataHub,” which continues to power LinkedIn’s data management to this day. DataHub
has a very vivid open-source community around the project. It can be easily installed as a
Docker container, and has native support for AWS and GCP.
DataHub can natively query and analyze many different data sources. Among them, the
most notable are Snowflake, Kafka, Looker, Redshift, Hive, BigQuery, dbt, PostgreSQL, and
many more. The project can be found at https://datahubproject.io/. All illustrations in the
following descriptions are taken from the Demo available at https://demo.datahubproject.
io/.
People
Roles
DataHub provides different roles for users. The key focus is combining different views on
metadata, all grouped under “Metadata 360”. A fine-grained access control system allows
the organization to set different aspects of user rights. This is focused on data owners and
data stewards. Individual data resources can have different access levels.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
190 6 Data Governance
Business Glossary
A key item feature of Datahub, especially for business users, is the business glossary. This
glossary provides an overview of all the business terms that are used within the organiza-
tion. A business term is normally owned by business users (such as data stewards and data
owners) within business units. Each term is defined and described in the glossary for global
lookup, and terms can be linked to each other if case they are related. A business term is not
a technical description but rather, a business description of data. It is normally based on
KPIs such as “customer churn rate” or “ARPU.”
Domains
Everything within DataHub is organized into “Domains”. These are typically high-level
groups of different data, such as “Marketing” or “Finance”. Domains themselves are grouped
into three items:
Entities: An entity is the smallest aspect. It is either a dataset, view, dashboard, or alike.
Entities allow for detailed analyses with Metadata 360.
Sub-domains: Very large domains can be split into sub-domains to allow a better over-
view of entities and data products. If we use a large B2C organization as an example,
subdomains could include “Customer Service”, “Marketing”, or “Shops”.
Data products: Data products are built around the Data Mesh architecture. It is a further
grouping of logical items, like tables, reports, and alike, that belong together within a
domain. Data products are more specific than sub-domains, as the latter is intended for
further grouping in large data landscapes.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.5 Technology (Data Governance Tools) 191
Process
Metadata 360
Metadata 360 is the central view for all the metadata about a data set. If we use the “cus-
tomer” dataset, it shows everything relevant to the customer. The figure below gives an
example of such a view.
Metadata 360 provides a central overview of what the dataset is about, who owns it, what
domains or data products it belongs to, and what tags are set for it. It also provides a clear
overview of:
Schemas: This includes the different fields that exist in the selected entity. If it is a
(materialized) view, it also provides an overview of all the different tables that are used
for this entity.
Definitions: Definitions describe how the entity is built. If we talk about views, defini-
tions describes the statements (such as SQL queries) that were used to create the view.
The different tables are visible as SQL statements (or however they are built).
Documentation: This is a written description of what the entity is. Any information that
might be relevant for a user is listed here.
Lineage: Data is highly distributed and has a lot of dependencies. To track these depen-
dencies, it is highly important to visualize them. This reduces errors in data processing,
since it makes visible both data sources that depend on a given dataset, as well as data
consumers of it. Lineage provides the overview of the dependencies different data entities
have to each other.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
192 6 Data Governance
Properties: This contains a key/value pair of different properties that are set on the entity.
Queries: Shows historic and commonly used queries on the dataset or entity.
Stats: Shows different statistics on the usage, for example when a user queried the data-
set.
Validation: Shows validations that are used on the entity, such as an assertion that
“customer_id” can’t be “null”.
Search
A very central item in DataHub is search. It is a very prominently placed on the screen,
being accessible for all users. The search runs through all data assets and provides clear
results in a Google-like search interface. The search allows easy filtering of data assets
based on the key concepts of DataHub.
igure 6.11
F
Search in DataHub
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.5 Technology (Data Governance Tools) 193
Technology
API
DataHub is highly extensible and provides a rich interface for developers as well as for
automation. The following APIs are available for developers:
GraphQL, a graph query processing language
OpenAPI, a standardized API definition
Timeline API for viewing version history
Rest.li API for standard APIs
Also, a native implementation is available in Java and Python. A CLI integration is also
available and offers some lightweight options to work with DataHub.
Extensible Metadata Model
DataHub’s metadata model is both powerful and comprehensive. In some cases, it is even
possible to extend this model and adjust it to a specific need within the company. To support
this use case, DataHub’s metadata model can be extended. Basically, users have two options:
either extending an existing entity or creating a new one.
In the ever-changing landscape of tools for data governance, the power of the cloud, given
the scalability, is an extension of the current trend. In this chapter, we dive into the realm
of cloud-based data governance tools, where we explore solutions offered by the giants of
the cloud computing industry: Microsoft’s Azure Purview, Google Cloud Catalog, and Glue
by AWS. These cloud data governance solutions have different capabilities, which we’ll
examine along with their features and advantages.
People
Enforcing Roles
Data stewards and data owners play an extensive role in Azure Purview. They can be set
individually on each level of the data, making it possible for users to note who is responsible
for the currently viewed data assets. They can be enabled to review changes in manual
lineage, name and notation changes within data asset and can approve or decline changes
in the metadata.
Empowering Users with Data Discovery
The Data Catalog and Metadata Management system empowers users to discover, under-
stand, and collaboratively use the data within the company. By centralizing data sources, it
enhances data accessibility and promotes understandability.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
194 6 Data Governance
Process
Data Lineage for Transparency and Governance
Azure Purview facilitates effective data governance processes by providing transparency
into the data flow. Understanding the journey of data is crucial for governance, and Azure
Purview delivers by representing data lineage and offering impact analysis tools. This em-
powers users to assess the impacts of changes at the source level, ensuring a governance
process that prioritizes transparency.
Enforcing Policies for Secure Data Management
In the realm of data security, Azure Purview plays a pivotal role in defining and enforcing
policies. Through finely grained access control mechanisms, the tool enables user permis-
sion management. By aligning with industry standards and implementing role-based access
control policies, Azure Purview ensures compliance with regulations while granting autho-
rized users the appropriate access to information. This process-centric approach reinforces
the security of the data estate.
Technology
AI-powered Data Labeling
Through features such as data classification and sensitivity labeling, organizations can clas-
sify and label data based on its sensitivity level. This aligns with compliance requirements
and ensures robust data security. The data labeling can be enforced with Purview’s AI capa-
bilities that detect similar patterns and attribute labels to that data automatically.
Improvement and Preview
It is to be acknowledged that, at the time of this book’s release, the tool is still evolving. Mul-
tiple sections remain in preview, highlighting the tool developers’ plan towards continuous
improvement. Organizations should be aware of the tool’s current state and anticipate ongo-
ing enhancements to plan their processes along the evolving capabilities of Azure Purview.
People
Roles
The Roles feature allows employees to access data while giving data managers the confi-
dence that users are only accessing and searching the right data, and that this democra
10
https://cloud.google.com/data-catalog/docs/concepts/overview
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.5 Technology (Data Governance Tools) 195
tization is being done securely and efficiently. This is all done through identity and access
management (IAM) roles, which allow users to search and tag resources within the Data
Catalog11.
Figure 6.13 Interface for search function within the discover function of dataplex
Empowering Users
Users are granted roles that contain certain permissions that can be customized to the
needs of the organization. Predefined roles, such as an admin, who has access to all the Data
Catalog resources, or a data steward who can add, edit, or delete text overviews for data
entries, are the more active roles, while the viewer role allows users to read the metadata
and tags of the resources.
Process
Data Lineage API
In terms of processes, the Data Catalog provides lineage and policy enforcement, which are
essential for a company’s data governance efforts. It offers lineage graph information that
can be either manually added or automatically generated within Dataplex using the Data
Lineage API. This helps organizations understand the origins of and transformations upon
their data assets.12
11
https://console.cloud.google.com/dataplex/search?project=chromatic-night-353807
12
https://cloud.google.com/data-catalog/docs/reference/data-lineage/rest
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
196 6 Data Governance
Data Protection
Additionally, Google Cloud Catalog integrates with other Google tools, such as Sensitive
Data Protection. This allows scanning of the resources for sensitive data and automatically
assigning the corresponding tags to the Data Catalog, therefore masking it from unautho-
rized users.
Technology
Support for Various Data Sources
Several Google Cloud data sources are supported by Data Catalog, while some on-premises
data sources can be integrated through the Data Catalog API. In addition, the community
contributes a large list of Python connectors for the most common on-premise data sources
which are, nevertheless, not officially supported by Google. Even unsupported data sources
can be manually integrated using Data Catalog Client Libraries or by manually building on
the API.
Integration with other Google Cloud Services
While the Data Catalog itself has limited capabilities, the data management tool it resides
within, Dataplex, allows for more complex data governance initiatives to be put in place.
One of them, Sensitive Data Protection, has been described above.
People
Roles
AWS Glue Data Catalog allows the assignment of roles such as data stewards and data own-
ers at different levels of the data estate, enabling individual responsibility for data assets.
Empowering Users
The tool simplifies the discovery of data and its associated metadata, enhancing the acces-
sibility of data assets.13
Process
Data Lineage for Transparency
AWS Glue Data Catalog facilitates effective data governance processes by providing trans-
parency into data flow and representing data lineage.
Enforcing Policies for Secure Data Management
AWS Glue plays a pivotal role in defining and enforcing policies, enabling user permission
management, and ensuring compliance with industry standards.
13
https://atlan.com/metadata-management-in-aws/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
6.6 In a Nutshell 197
Technology
Support for Various Data Sources
AWS Glue Data Catalog supports a wide variety of batch- and stream-based data sources,
including seamless support for AWS services such as S3, RDS, Kinesis, DynamoDB, and
DocumentDB.
Integration with Atlan
AWS Glue Data Catalog can be integrated with Atlan for metadata management and data
collaboration.
While AWS does not have a dedicated data governance tool, it has multiple tools and ser-
vices supporting it. These include AWS Identity and Access Management for access control,
and a dedicated tool for managing data lakes, called AWS Lake Formation.
■ 6.6 In a Nutshell
Machine Learning
Operations (ML Ops)
Zoltan C. Toth, György Móra
■ 7.1 Overview
Machine Learning Operations (MLOps) is an interdisciplinary field at the intersection of
data science, engineering, and infrastructure operations to deliver sound and performant
machine learning solutions in a reproducible, performant, and scalable way. Responsibili-
ties also include controlling the quality of models through evaluation and benchmarking,
providing stakeholders insights about how the model works with the help of model under-
standing, and monitoring the quality of deployed solutions. Systems performing these tasks
are called ML pipelines. The main difference between ML pipelines and other data pipelines
is that ML pipelines are specifically built to perform ML model-related tasks. It can be one
complex 3rd party system or custom-made components glued together by custom code
according to the needs of the task.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
200 7 Machine Learning Operations (ML Ops)
The main areas of MLOps are aligned with the machine learning lifecycle. To train ML mod-
els, feature engineering should turn the collected data into a form interpretable by the ML
method of choice. Regarding the data flow, data collection is the border between MLOps and
DataOps, and its responsibilities overlap as cleaning, aggregating, and transforming the
data might happen outside of the model training pipeline. Continuous monitoring and eval-
uation of the deployed models provide insight for business processes to refine goals and
requirements the ML models should meet, and these requirements are fed into the next
iteration of the ML cycle (Figure 7.1) to improve the model or to adapt to changing environ-
ments. While setting the goal of an ML project and evaluating at the end are important parts
of the cycle, those are more business processes, and MLOps can only indirectly support
them in the previous steps.
Goal Setting
Performance Data
Evaluation Collection
Model Feature
Monitoring Engineering
Model Model
Deployment Training
Model
Evaluation
The central element is the ML model, and the exact methods employed will depend on the
type of models deployed and how the production system will use the deployed model. With
the broader adaption of large pre-trained models that perform well on raw data, the em
phasis in MLOps is shifting from feature engineering to model adaptation, fine-tuning, and
prompt engineering, and the importance of monitoring and governance also increases as
the behavior of more complex models is less predictable. The inference of sizeable neural
network models in a performant way also needs special attention. Classical machine learn-
ing methods are still widely used for a variety of tasks. Preparing training data, governance
of the data through the training process, and extracting and storing features are still essen-
tial parts of the ML pipelines used to train those models.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
7.1 Overview 201
While data exploration is not directly part of MLOps, the ML pipeline should support data
scientists in doing exploratory data analysis and, based on the findings, implement the
feature extraction or select and tune ML methods to train the models. Data analysis is usu-
ally done in notebooks. MLOps can significantly improve the Data Scientists’ experience by
providing data access and standardizing the tools across the experimental and production
frameworks.
Data Scientists do most of their original work in the feature engineering layer of any ML
pipeline. The goal is to, based on the findings during the data exploration phase, Data Sci-
entists encode the raw data for the ML system into an interpretable format. Based on the
type of the method, feature extraction might need parameters stored or even statistics,
which need to be calculated based on the data. MLOps should provide a way to transform the
ideas into executable code in the production system or to use the experimental code directly
to generate the features. To have a good quality ML system, the data pipeline and the pro-
duction system must generate the same features for the same input in a reproducible way.
Besides reproducibility, MLOps should tackle the problem of dynamically changing feature
definitions along with their proper versioning and change management.
During training, the ML model is fitted to the training data. This optimization ensures the
ML model can emulate the processes and patterns in the data with the highest possible
fidelity. Training often involves iterative steps, refining the model in each round, and mea-
suring performance and fit on a dedicated dataset. This iterative training-evaluation-tun-
ing-training cycle is repeated until the model cannot be further improved or a specific cycle
count is reached. ML pipelines can automate and extract valuable information about the
process and intermediate data for the data scientists.
ML algorithms usually have several parameters that influence the ML model’s training. The
best parameter setting depends on the task and the training data. Finding the right param-
eter combination might involve running the training, evaluating the results multiple times,
and observing performance indicators. This process can be significantly improved with
automation by MLOps.
From the above, it is clear that training a model produces a lot of data and involves decisions
that need to be recorded. Reproducibility, transparency, and trackability can be achieved by
automatically collecting intermediate statistics, metrics, and results and storing them with
the model. MLOps solutions for storing models with metadata can store metadata about the
models and contain information about the data, features, and data used to train a model for
further reference.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
202 7 Machine Learning Operations (ML Ops)
Model training is a data and compute-intensive task. When training on large datasets or
when the number of iterations and experiments is large, the process might have to be dis-
tributed across multiple computers. Some ML training, unlike the other parts of the data
processing pipeline, dramatically benefits from using GPUs in training. Training can also be
a significant cost for classical ML methods. However, training Deep Neural Networks or
similar structures involves complex and expensive infrastructure where the proper hard-
ware selection and architecture are crucial.
Depending on how an ML model is used, the production system can access the model in
different ways. If the workload allows it, the model can be simply serialized and then loaded
by the target system. Instead of serialization, the ml pipeline can use model formats like
ONNX to store the model. This way, the training and production systems can be written in
different languages and separated into independent systems. However, not all model types
are supported on all platforms, as ONNX was originally a neural model storage format.
If the model cannot be exported into a standard format and loading it directly to the target
system is not feasible, a prediction server can wrap the models and stay in the environment
they were trained in. The microservice providing the predictions is a completely decoupled
system this way. Still, it may increase the complexity of the production system and directly
introduce less performant and stable software into critical paths in production. The predic-
tion code can be installed in a suitable environment, or the code providing the predictions
can be containerized. Directly containerizing the code written by the research team might
save time but will introduce software components to the production system that might not
designed with the necessary durability and resilience.
Classical machine learning models are computationally expensive to train but relatively
cheap to use for predictions; therefore, the performance of such systems is usually limited
to data transfer, not computation. Most deep neural network models, on the other hand,
might require expensive inference to provide output, primarily when they are used for gen-
eration instead of discrimination. The use of expensive GPU-equipped specialized hardware
for the inference might force us to use a prediction server to utilize the specialized hard-
ware for the predictions only.
Like any other part of the product, the machine learning model should perform how it was
designed. Data scientists and the product define the metrics and the criteria, and the MLOps
system’s job is to execute the evaluations and store them with the model. The evaluation can
be complex, with many experiments on different datasets, so automation is vital to exclude
human error.
In some applications, the models are continuously updated. To be able to frequently train
and deploy new models, evaluation and deployment should work similarly to CI/CD systems
in software engineering. The role of testing in an ML pipeline is fulfilled by the evaluation
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
7.1 Overview 203
of the model. If the model can meet certain evaluation criteria the system might automati-
cally deploy it to production. Some systems might shadow-deploy the model and continue to
monitor it on production traffic until it is considered safe to switch.
ML models can be as simple as linear regression, where the coefficients deliver information
about the model and individual predictions. More complex models cannot be evaluated and
verified by analyzing the learned internal parameters. We can understand and assess com-
plex classical and deep learning models by observing the model’s interaction with data.
This practically means that predictions are generated on test datasets, the features might
get perturbed, and the model’s output is correlated with the input feature values. Model
explainability can be interpreted as exploring the structure and the decision boundaries
of a model or providing insights about individual predictions. Depending on the use-case,
MLOps supports both needs.
From an MLOps standpoint, this step is similar to the evaluation. The model should be eval-
uated, data collected, statistics and graphs generated and stored.
An often overlooked part of ML productization comes after the first model is deployed. There
is a model in production with a feature extraction framework in place, but as the environ-
ment changes and more new training data becomes available, the models should be up-
dated. From a monitoring and compliance standpoint, systems should be able to track the
model used for each individual prediction. Models should have a unique identifier, and all
evaluation and training information should be available.
Since features and models are tightly coupled, the feature pipeline or storage should also be
versioned. It is often a requirement that when releasing the model, all the pipelines extract-
ing and maintaining features should not disrupt the running system. Because the model
and all the other related systems are not single units, it might be necessary to have multiple
model versions deployed simultaneously. When all components of the ML system are ready
and tested, the new version can be activated.
As features are derived from certain datasets, versioning the data and maintaining its trace-
ability through the ML pipeline is a requirement for all modern applications. Data engineer-
ing can provide provenance for the data; the ML pipeline’s responsibility is to properly
maintain the lineage throughout the system until it becomes part of the ML model meta-
data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
204 7 Machine Learning Operations (ML Ops)
The model’s environment might be stationary; for example, industrial systems might not
change over time, but most of the data an ML model will interact with will evolve. Model
evaluation can measure the model’s performance on a specific dataset. Still, to ensure that
the parameters do not degrade in the outside world, the models need constant monitoring
and, in some cases, alerting.
The model’s predictive power is not always observable directly because of the lack of label
data. Still, other parameters, like the output distribution, can be easily collected. The MLOps
task calculates and stores the signals referenced to the models. The data can be fed into
the organization’s standard monitoring systems, or analysts can access them to do periodic
reports.
Reproducible, standardized model training lowers the burden on Data Scientists and ML
Engineers because they can rely on the ML pipeline to correctly carry out the defined tasks
and focus on solving business problems instead of dealing with the boilerplate code and
manual training and testing.
Pipelines help to track data and satisfy data governance requirements, and automatic eval-
uation, experiment, and performance tracking can provide transparency and lower the
administrative burden on data scientists. Version control and tracking eliminate costly mis-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
7.3 Several Common Scenarios in the MLOps Space 205
takes and automate manual model deployment. Monitoring offers ease of mind and feedback
regarding model performance in production.
Introducing proper techniques and processes can lower the ambiguity of machine learning
models and present a sense of control and quality so ML components can be trusted build-
ing blocks of applications similar to other software products.
MLOps systems also help increase project predictability and can shorten model develop-
ment cycles through automation.
Data scientists use notebooks for prototyping, exploration, experimentation, and to generate
reports. The flexibility and great selection of tools and libraries make these tasks very effi-
cient. When notebooks are used for feature and model development, the final code should
eventually be transformed into a production system. This could mean the complete reimple-
mentation of everything the data scientist wrote.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
206 7 Machine Learning Operations (ML Ops)
For the correct operation of ML models, the input data must be transformed to features the
same way during prediction in the production system as it was during training.
Data scientists usually train, tune, and test models on finite datasets, but predictions in
productions are often carried out in an interactive system. The main difference is that cer-
tain calculations cannot be carried out the same way as during the training in a limited
time. The production features will be developed by ML engineers using different tools and
paradigms, which might lead to training-serving skew. Product cycles will be longer be-
cause every feature has to be implemented twice. Also, clear communication and MLOps
processes are required between the data scientists and engineers to make the development
efficient.
When the production and the experimentation systems share the same language and data
platform, the code used to perform the feature development can be directly used in the
production systems after the necessary tests and verifications are added. This method does
not need reimplementation or complicated systems. Still, it can only be used if the features
do not need statistical parameters based on the data or the production system can calculate
them identically. This limits its usability to either constant parameter feature extraction or
batch systems.
Problem: When the ML pipeline used to train the model has a different architecture than
the prediction environment or is written in another programming language, it can be chal-
lenging to maintain feature extraction code and ensure they generate identical features.
Solution 1: Feature stores
One way to decouple the production and experimentation systems is to use feature stores or
databases. The system calculating the features is separated and calculates the feature val-
ues independently. However, feature stores for low-latency applications might not solve the
problem of prediction time feature calculation.
Some more modern feature stores provide a computational platform to define and calculate
the feature values, making them similar to the next solution.
Solution 2: Feature descriptions instead of code
Sharing feature definitions only across different systems can decouple them and enable
other computational platforms to implement the same feature most efficiently. Each imple-
mentation of the feature primitives can be tested in each system separately to ensure con-
sistency and correctness. This requires a complex, often custom-built system, and when
features rely on data aggregations, it is still complicated to implement those scenarios.
Problem: Features calculated by aggregations or using external data not directly accessible
at prediction time need special treatment. The exact architecture of the production aggre-
gate feature-extraction system depends on the requirements.
Solution 1: Feature store with batch calculation
Aggregation-type features or parameters might be calculated daily in the background and
stored in a feature store or cache. This way, prediction can be fast, but the feature value will
be half a day old.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
208 7 Machine Learning Operations (ML Ops)
7.3.2.1 Tradeoffs
The above solutions require coordination among data science and engineering and involve
the product to evaluate the tradeoffs of each choice.
The first tradeoff is between the calculation’s complexity (therefore, the ML model’s preci-
sion) and the prediction’s latency. Using a shorter history for time-series data or limiting the
amount of data aggregated will speed up calculations but erode the feature’s predictive value.
Using heuristics, sacrificing the all-time consistency of the feature values, and settling for
only eventually consistent feature values can provide low latency using few resources. The
tradeoff is that the feature values are only estimated until the next batch update. For exam-
ple, counting unique values can be done by a daily batch job for the whole history. The re-
al-time system might only track the unique values since the last batch update and assume
that the two sets are distinct and the feature value will be the sum of the batch and real-time
calculated value. This way, the feature value approximates the actual value from above, and
the absolute error is maximized and not aggregated over time.
There is an aspect that is less technical but can save technical complexity. Does the feature
value make sense to calculate in particular (and usually complex) cases? Outliers can clog
any data system. Joining two tables with a few high-cardinality items might cause skew be-
cause one item has millions of items in the other table. For example, calculating the sum of
the order values in real time for the last 90 days for a few outlier customer who generates
too many transactions to keep in memory might not be valuable information because the
sum order value does not change significantly in a short time. Calculating the value fresh
does not add additional information to the feature.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
7.3 Several Common Scenarios in the MLOps Space 209
Deploying the model itself is a manageable challenge. Most modern ML frameworks have
some solution to export the model in a standard format or even generate native prediction
code for specific programming languages. H2O, for example, can create POJOs, which are
plain Java objects encapsulating the model.
While the feature extraction pipeline is not part of the strictly defined ML model, the feature
definitions and the model should be deployed together. There were attempts to make stan-
dardized model formats where feature extraction could be bundled into the model. PFA de-
livers predictions and offers a safe computational language to preprocess the data before
scoring. Feature encoding and some feature engineering can be implemented in the prepro-
cessing layer; this way, every model component is bundled together in one format.
Classical ML models consist of different kinds of parameters, and the data structures stor-
ing them are very different. For example, linear and logistic regressions have simple nu-
meric parameters, but decision trees and forests also have structures. However, the various
neural networks are similarly structured, and defining a standard format is more manage-
able. PMML used XML to describe multiple models, and each model family had to be added
to the description language, and its coverage was low.
Problem: How do we transfer the model between the ML pipeline and the system, making
the predictions?
Solution 1: ONNX
The most common system-independent model format is ONNX, and while it is designed for
neural networks, it supports other model types on specific platforms. The coverage for clas-
sical models is not perfect and varies between platforms.
Solution 2: Use the native model format
Almost all ML frameworks can export their models in a particular format. Still, it is usually
not executable on a different platform and only some pack computations with the models.
TensorFlow saves models together with the computations, which makes it easy to deploy the
models for prediction into another Python environment.
Solution 3: Use proprietary model format and predictor code
Calculating predictions is usually significantly less complex than training the models.
Sometimes, the model should be embedded in a different system than it was trained on, and
there is no suitable format. Or the model export is language-dependent and cannot be inter-
preted in another language. In this case, the model parameters can be exported into a pro-
prietary format. Because only one model should be supported, the model interpreter will be
relatively simple and could be implemented in any language.
Solution 4: Prediction server
In a real-time system, predictions should be delivered with low latency. Embedding the
model directly into the production system might give the best results. The model format
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
210 7 Machine Learning Operations (ML Ops)
should be selected to support the target platform, or the language of the production system
should be picked according to the ML model.
The microservice architecture makes integrating components written in different languages
easy and provides a solution for model deployment, too. The models can be deployed to a
prediction server, and the rest of the system can access the model via a language-indepen-
dent API. Cloud providers also employ this solution.
7.4.1 MLflow
7.4.1.1 Installation
The MLflow project is hosted on mlflow.org. It is an open-source Python package available
through pip and the Python Package index. Once installed, the MLflow tracking server,
REST API, and Web UI can be started by executing the mlflow command. With a proper con-
figuration, MLflow is capable of integrating into cloud environments, such as using cloud
objects stores like S3 or the Azure Blob Storage for storing models and third-party databases
for tracking experiments
■ 7.5 In a Nutshell
MLOps covers a wide area and requires different fields to work in unison
Delivering models to production is crucial for any ML project’s success
Features and feature extraction are tightly coupled to the ML models
Solutions to a problem might not be purely technical, and good coordination
with data science and product is essential
The tooling landscape for MLOps changes at a high speed
To get started with MLOps, take a look at MLflow. For more extensive
MLOps applications, explore Kubeflow
8
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Machine Learning
Security
Manuel Pasieka
When discussing cybersecurity and machine learning, people usually concentrate on using
machine learning to enhance cybersecurity, such as improving intrusion detection systems,
detecting malware, and monitoring emails. They also consider the cybersecurity aspects of
systems that incorporate machine learning as a component, such as data access and control
or software supply chain security.
However, this chapter focuses on machine learning security (MLSec), which deals with the
security aspects of training and inference of machine learning models. It also explores
methods to protect the confidentiality of training data that prevents an attacker to steal and
know about the training data. Maintain the integrity of models for their intended use and
ensure their availability of the model to serve requests.
To achieve this, we will first introduce core concepts and terminology used to describe how
machine learning models can be attacked at various life-cycle stages, such as data collec-
tion, training, and inference. We will present common attack mechanisms against machine
learning models in general and, in addition, put a special emphasis on large language mod-
els (LLM). LLMs are addressed separately due to their recent widespread adoption and inte-
gration into numerous applications by development teams that often do so without consid-
ering potential security aspects.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
216 8 Machine Learning Security
The aim of this chapter is to provide an introduction to machine learning security and to
offer an overview that helps readers incorporate this awareness into their future ML proj-
ects. This knowledge can help prevent the misuse of systems in production; a threat which
may have significant legal and reputational consequences.
1
Barreno et al., (2006), “Can Machine Learning Be Secure?”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.2 Attack Surface 217
2
Sven Herbing, (2019), “Part 1: The attack surface of machine learning and its implications”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
218 8 Machine Learning Security
The goal of a model stealing attack is for the attacker to obtain a copy of, or information
about, the target model, such as its architecture and hyperparameters or weights, which can
be used to reproduce its decision boundaries. With the decision boundaries being the result
of the training process of a model and enable the application of a model for its intended use
case. Model stealing can be used in black-box access scenarios where an attacker has access
to query the target model (through an API, for example) or has request and response pairs.
In context of CIA, this attack breaches the system’s confidentiality, revealing information
that should not be available to a user under normal circumstances. Imagine a scenario
where the attacker is be a competitor replicating a company’s paid image classification ser-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 219
vice through a publicly available API without the owner’s knowledge, effectively stealing
the company’s intellectual property and avoiding the costs of having to develop such a
model on their own.
As we will see in the following chapters, this method is often used as a first step for an ad-
versary to gain more access to a system and enable other attack methods that would not be
possible without white-box access. The attacker achieves this by using model stealing to
train a local shadow model that behaves like the target model and serves as a proxy. With
white-box access to the shadow model, the attacker can then use methods not available in a
black-box setting to build an attack model or extract information about the training data.
Figure 8.2 illustrates the abstract process of stealing a model in a black-box access scenario,
where an attacker can query the target model in order to create a shadow model3.
In this scenario, a data owner has trained a machine learning model and created a service
that processes requests , using internal databases (DB) to generate replies . The
attacker uses the available API service and analyzes these requests and responses in order
to derive a shadow model.
This attack can also be applied when the attacker has no means to query the target model
directly, but instead has access to request-response-pairs (derived from logs, for example),
which can be used to build the shadow model.
Figure 8.3 by Papernot et al.4 demonstrates the stages of a model stealing attack. The at-
tacker begins by selecting an initial training dataset and the architecture of the shadow
model (steps 1 and 2). In practice, this means the attacker uses her knowledge about the
system to select a shadow model architecture and a training dataset that resembles the
target system close as possible. In an iterative process (steps 3-5), a new data point is
added to the training dataset (step 3), and the shadow model is trained for another iteration
(step 4) based on this new pair of data point and the response from the target model.
Based on the difference between the shadow model’s prediction and the target model’s
prediction , a new training data point is added to the training dataset (step 5), and the
process is repeated. Thus, step by step, the attacker trains a shadow model that is able to
imitate the target model.
3
Tramèr, (2016), “Stealing Machine Learning Models via Prediction APIs”
4
Papernot et al., (2017), “Practical Black-box Attacks against Machine Learning”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
220 8 Machine Learning Security
The number of training data points needed to reproduce the target model depends both on
the complexity of the task and the initial dataset and architecture selected. An attacker will
use their prior knowledge about the target model to try to replicate the initial architecture
and training dataset as closely as possible. But they will have to make a trade off between
the number of queries to the target model that cost time and resources and the quality of
the reconstruction, measured by how well the shadow model replicates the target model.
Depending on the architecture and the quality of the reconstruction, the created shadow
model will serve as a proxy of the target model, allowing the attacker to create attacks
against the shadow model that translate to the target model.
For a comprehensive survey of current research on model stealing attacks and defenses, the
reader is invited to look at Oliynyk et al.5
In summary, after a successful model stealing attack, an attacker will have recreated the
model behavior in a shadow model under her control. Limiting direct access to the target
model’s prediction is the only way to prevent this attack, which makes them especially
dangerous for public facing ML services.
5
Oliynyk et al., (2023), “I Know What You Trained Last Summer: A Survey on Stealing Machine Learning Models and
Defences”
6
Shokri et al., (2017), “Membership Inference Attacks Against Machine Learning Models”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 221
One possible attack scenario is a bank that performs fraud detection based on individual
transactions containing information like credit card numbers, bank account numbers.
These transactions are classified as fraudulent or non-fraudulent by a machine learning
system. In a membership inference attack, an attacker analyzes the model’s predictions to
identify which individual transactions are in the training dataset. If successful, the attacker
gains “quasi access” to the original training data, which is highly sensitive information that
the bank otherwise invests significant resources to protect and keep secret.
In short, a successful execution enables the attacker to verify if individual datapoints are
part of the training dataset and to reconstruct at least parts of the training data.
This type of attack can be performed in a black-box access scenario or on a shadow model
extracted from the target model. It relies on a model’s tendency to change its prediction
based on the similarity of a request to a data point in the training data. In an extreme
scenario where a model has been trained to overfit heavily on the training data, a request
identical to one of the training points will produce a prediction with a very high confidence
compared to a new data point.
Hu et al.7 provide a survey of different membership attack methods, distinguishing between
white-box and black-box scenarios where an attacker can either study the target model in
detail (or a model obtained through model stealing) or access the request-response pairs
from the target model. In both cases, it is assumed that the attacker has knowledge about
the training data distribution, but not about the presence or absence of individual records.
Figure 8.5 illustrates the process of training an attack model that can determine if an indi-
vidual data point (the target data set) was contained in the training dataset (private data) or
not. In this case, an ensemble of shadow models (shadow model 1 to k) is trained on disjoint
shadow training datasets created from the same training dataset distribution as the private
dataset by sampling or synthetic generation. The different shadow models are then trained
on the same task as the target model, and a label is attached to their predictions indicating
if they have been part of the shadow training dataset. This attack training set consists of
pairs of “in” (data points that are part of the shadow training set) and “out” (points that are
not contained in the shadow training set) training points, that are used to train an attack
7
Hu et al. (2022) “Membership inference attacks on machine learning: A survey.” ACM Computing Surveys (CSUR)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
222 8 Machine Learning Security
model. A successfully trained attack model is then able to identify if individual data points
in the target dataset are part of the private data by evaluating the target model’s predictions
on those individual data points.
So far, we have discussed methods that attack a system’s confidentiality by exposing infor-
mation about either the target model or the training data. Next, we will focus on a method
that attacks the target model’s integrity or availability by modifying the training data, caus-
ing a shift in the model’s decision boundary8.
An example of such an attack would be a fraud classification use case, where an attacker
is able to introduce “poisonous” data points into the training data, manipulating the fraud
classification model in a way that allows fraudulent transactions to pass undetected.
An attacker achieves this by introducing poisonous data points either through an already
poisoned dataset that is included in the training data, the data collection or updates. Up-
dates are particularly relevant in the case of a closed-loop system where the training data is
continuously extended with curated requests, by sending them masqueraded as normal
8
Wei Koh et al., (2021), “Stronger Data Poisoning Attacks Break Data Sanitization Defenses”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 223
requests. This general attack can be targeted towards affecting a specific output class or out-
put range of a prediction or indiscriminate, affecting all predictions.
In other words, if successful an attacker is able to introduce data points into the training set,
that will confuse the model to such an extent, that it will not be able to perform the desired
task.
Figure 8.6 illustrates an attack against a binary classification algorithm that was trained on
(clean data) and, by introducing a new poisonous cluster of data points (poisonous
data), shifts the decision boundary so drastically, that it effectively “breaks” the model
integrity.
At this point, it is worth to briefly discuss how data poisoning can differ between breaking
the integrity and the availability of a system. A model’s integrity that ensures its trustwor-
thiness and correctness is broken if through data poisoning if its predictions are less cor-
rect. The availability of a model is compromised if malicious crafted input causes the system
to fail in a way that it is unable to perform its task at all.
In an image classification task that tries to identify digits from 0–9, such as the famous
MNIST dataset, a system is retaining its availability if it can answers requests with a re-
sponse. Taking “mistakes” out of the equation, if the model responds with the correct value
for a digit, it maintains its integrity. Through data poisoning of the training data, an attacker
could shift the decision boundaries in such a way that the model would always respond with
the prediction “7”, independent of the input. Such a system would be available, but that its
integrity is compromised.
Attacks on the integrity of a model through data poisoning, are therefore, causative attacks
that modify the model “permanently”. Data poisoning attacks that on the other hand are
forcing some part of the system to fail and stop to respond to requests are exploratory at-
tacks.
It is not the case, that integrity is limited to the training and availability to the operation of
a system. If an attacker successfully introduces specially crafted input data into the training
pool, that causes the training process to fail, the availability of a system is compromised.
Such an attack is difficult to detect as many systems implement automatic error recovery
and as training data loading is often randomized, the training process can fail at seemingly
random points. This is especially problematic for systems that implement machine learning
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
224 8 Machine Learning Security
operations strategies that offer automatic retraining and evaluation of models with continu-
ous deployment. A booby-trapped data point that is introduced through an update for exam-
ple will prevents these automatic retraining from succeeding, preventing any future auto-
matic model updates.
A systems integrity on the other hand can be at risk during runtime if malicious user data
is “shared” between requests. One such an example are Retrieval-Augmented Generation
(RAG) systems that are discussed later in this chapter, where an attacker can introduce
malicious input that is used only when processing new requests, but not during training by
large language models. This poisoned data can cause the system to produce unreliable and
even malicious output.
As shown in Figure 8.6, data poisoning can sometimes be prevented by data validation and
sanitation, as demonstrated in the third panel with the sanitized data , which recon-
structs the original decision boundary. However, depending on the use case, this might be
challenging to achieve in practice.
Figure 8.7 illustrates a difficult scenario where a federated learning system is under a coor-
dinated attack9. In a federated learning scenario, a model is trained by combining distribu-
tions from multiple nodes. These contributions can be in the form of data points, weights or
gradients, or complete models. It is an intrinsic property of a federated system that the
contributions of the nodes differ from each other, making it difficult to detect tampering or
malicious contributions. The coordinated attack of multiple nodes can have a cumulative
effect and, like the binary classification example in Figure 8.6, compromise the model’s in-
tegrity. A similar situation is true for the introduction of malicious data points over a longer
period.
9
Sun et al., (2020), “Data Poisoning Attacks on Federated Machine Learning”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 225
Systems like recommender systems or spam filters try to leverage social credit mechanisms,
such as the reputation a user builds over time, in order to filter input from possible mali-
cious sources with a bad or unknown reputation. However, even this can be circumvented if
an attacker captures highly credible users and submits data in their name or by building up
social credit over time and then attacking the system.
In short, the only way to protect a system from data poisoning is to apply filtering mecha-
nisms to ensure that only clean data enters the training process. Distinguishing between
clean and poisoned data can be challenging, especially when data is collected over a longer
period from multiple sources. To prevent data poisoning attacks, AI software should employ
measures such as robust data validation and data source authentication.
In this chapter, we discuss adversarial attacks, which have become very prominent in the
field of machine learning security. This is partly because these attacks are very easy to
visualize and communicate to a non-technical audience.
Adversarial attacks are exploratory attacks that compromise a system’s integrity by using
specially crafted requests. These requests make a model respond in a way that is difficult
to understand for a human observer and appear like obvious prediction errors. Adversarial
attacks gained popularity through work in computer vision10, although they were initially
developed for NLP use cases in the domain of spam filters11.
An iconic example of an adversarial attack is tampering with traffic signs to deceive self-
driving cars into detecting a stop sign as a speed limit sign. In this attack, a stop sign is
modified by adding specific image patches that cause the image classification algorithm in
the car to misidentify the stop sign as a speed limit sign, breaking the system’s integrity
and causing serious safety issues.
In short, adversarial attacks provide a model with specially crafted input data that makes
the model fail in unexpected ways.
Figure 8.8 illustrates the components of an adversarial attack on an image classification
task. The original image is correctly identified as a panda with moderate confidence. After
an adversarial modification or perturbation has been added to the image, the model detects
a completely different animal (a gibbon) with very high certainty. Paradoxically, the modi-
fied image looks to a human observer almost identical to the original image.
10
Goodfellow et al., (2014), “Explaining and harnessing adversarial examples”
11
Biggio et al., 2018 “Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
226 8 Machine Learning Security
Beyond image classification, adversarial attacks have been successfully applied not only to
digital systems but also to real-world scenarios12, such as fooling surveillance systems and
attacking voice-controlled systems13 as demonstrated in Figure 8.9.
Figure 8.9 Two examples of adversarial attacks in other tasks and modalities
An attacker needs white-box access to the target model or a shadow model to perform an
adversarial attack. If they succeed, they will have an attack model that is able to transform
an input data point in such a way, that it fools the target model. This can happen as a tar-
geted attack in which the target model is fooled to predict a specific class, or as an indis-
criminate attack in which the target model will predict any other class but the true label.
Figure 8.10 illustrates the iterative process of such an attack against a digit classification
task, where perturbations of an input image are evaluated based on the shifts they produce
towards the target class, resulting in a specially crafted attack image that differs from the
original image by only a few pixels14.
12
Thys and Ranst, (2019), “Fooling automated surveillance cameras: adversarial patches to attack person detection”
13
Carlini and Wagner, (2018), “Audio Adversarial Examples: Targeted Attacks on Speech-to-Text”
14
Papernot et al., (2016), “Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 227
This attack starts by taking an unmodified input image X and evaluating how perturbations
of X affect the predictions of the target model F(X). In a second step, a specific perturbation
is combined with the original image and its effect on changing the classification result
of the target model is tested. Steps 1 and 2 are repeated until the attacker is
successful in producing a new image X* that can fool the target model.
In case the attacker only has black-box access to the target, they can perform a model steal-
ing attack first, to create a shadow model to which they have white-box access instead of the
target model.
Research shows that adversarial attacks are surprisingly robust15, allowing them to be often
transferred between very different models. This transferability of adversarial examples be-
tween different models is possible because of “non-robust features”16 that are contained in
the data and are used equally by different models. The non-robust features have no semantic
value to a human observer but are highly predictive of the tasks the model tries to solve.
In the image domain for example, non-robust features can be textures, background colors or
high frequency image components. These features are normally ignored by humans, for
example to classify an animal, and instead we focus on physical traits.
In summary, adversarial attacks make a model fail by providing it with specially crafted
input data. Researchers are still uncertain if adversarial attacks are an unavoidable prop-
erty of deep neural networks and therefore a potential category of attacks that will remain
relevant.
The final method in our series are backdoor attacks, that similar to adversarial attacks,
compromises the integrity of the target model if specific malicious input is provided. A
compromised model will operate normally until, a specific trigger pattern is present. The
15
Waseda et al., (2022), “Closer look at the transferability of adversarial examples: how they fool different models
differently”
16
Ilyas et al., (2019), “Adversarial Examples Are Not Bugs, They Are Features”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
228 8 Machine Learning Security
trigger pattern will cause the model to switch into an alternative operation mode that is
defined by the attacker.
While adversarial attacks are exploratory and work on an instance-by-instance bases where
each individual image needs to be specially prepared, backdoor attacks are causative and
work with any input that contains a specific trigger pattern.
An example of this attack is a scenario where an attacker modifies an image classification
model which a company uses as a base model for their intrusion detection system. The com-
pany uses the base model and fine-tunes it to detect people entering a restricted area. Unbe-
knownst to the company, the attacker has included a backdoor in the model that disables its
people detection mechanism if a specific image patch is present in its input. This modified
model contains a backdoor that breaks the integrity of the intrusion detection system, giv-
ing control to the attacker.
In short, backdoor attacks modify the training process or training data of a model in a way
that enables the attacker to control the task the model is performing through the presence
or absence of a trigger.
Figure 8.11 illustrates the concept of backdoors as separate neural networks that are trig-
gered by specific input patterns (middle panel in the figure). In practice, an attacker would
try to incorporate the backdoor network into the target model so its existence remains hid-
den17, and the network is not active until triggered (right panel in the same figure).
A backdoor can be installed by using specially crafted training data or by modifying the
training process.
The case of a backdoor installation with modified training data18 is conceptually similar to
the combination of adversarial attacks and data poisoning. Adversarial examples that
17
Gu et al., (2019), “BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain”
18
Saha et al., (2019), “Hidden Trigger Backdoor Attacks”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.3 Attack Methods 229
a ppear like normal data points contain a special trigger pattern used to teach a model when
to activate the backdoor. During inference without the presence of the trigger pattern, the
model behaves normally but changes its prediction once the trigger pattern is present.
Figure 8.12 demonstrates such modified data points, where the attacker creates a patched
source image (third panel from the left, containing a colorful trigger pattern) with an “at-
tack class” and combines it with a clean target image. This modified image is then labeled
with the target image class, which appears unmodified to any human observer. During in-
ference, if the trigger pattern is absent, the model will predict the correct class, but once the
trigger pattern is present in the input, the model will predict the attack class, “ignoring” the
rest of the image. This results in poisoned target data points that appear completely normal
to a human observer, but to a machine learning model look like the patched source images.
This type of attack is difficult to detect because the poisoned training data looks completely
normal to a human observer, and the backdoor is activated by a trigger through a mecha-
nism of “non-robust features” as described in the section on adversarial attacks.
The second option to install a backdoor is the modification of the training process of a
model19. By applying a malicious training loss calculation and poisoned training data,
researchers were able to create backdoors that operate with multiple trigger patterns, each
of them triggering another type of “operation”.
Figure 8.13 demonstrates a multi-purpose backdoor. In the left panel a digital classification
algorithm is backdoored in a way to perform the addition or multiplication, and on the right
panel an image analysis model that detects the number of people in an image can be trig-
gered to detect the presence of a specific person. In these cases, an attacker is misusing the
model for a purpose other than its intended use. They highlight the fine-grained control
backdoors offer to an attacker compared to adversarial examples.
In summary, a successful backdoor attack introduces specially crafted training data or a
modified training process, that enable the attacker to control the task a model is perform-
19
Bagdasaryan et al.,(2021), “Blind Backdoors in Deep Learning Models”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
230 8 Machine Learning Security
ing. As the newest type of attack presented here, the attack scenario as well as mitigation
strategies are still unclear.
■ 8.4 M
achine Learning Security of Large
Language Models
In the previous section, we discussed various attacks on machine learning models through-
out their life cycle. Although many examples were from the image domain, these attacks
can also be applied to other modalities, such as audio or text.
This chapter aims to emphasize how these attacks can be applied to natural language pro-
cessing (NLP) and large language models (LLMs) specifically.
The focus on LLMs is due to two main reasons. Firstly, LLMs have recently been a popular
research topic, and many new attack vectors are emerging. Secondly, LLMs have been
widely adopted in various domains and applications. In most cases, LLMs are integrated
through service provider APIs or self-hosted open-source models20 like Llama, Falcon or
OPT. In both cases, very little consideration is given to the security aspects of LLMs, and it
is the objective of this section to raise awareness of the security concerns of using LLMs in
customer-facing applications.
For an introduction into the topic of data extraction, please refer to the previous section
8.3.2 under Attack Methods.
20
https://explodinggradients.com/the-rise-of-open-source-large-language-models
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.4 Machine Learning Security of Large Language Models 231
Figure 8.14 illustrates a membership inference attack in which a language model is used to
complete a task or continue a prompt with the goal of exposing confidential information
contained in the pre-training data.
Researchers have succeeded in recovering parts of the training data in early generations of
LLMs21 like GPT-2 and found bigger networks to be even more susceptible to memorization
and therefore membership inference attacks. These attacks exploit the highly overparam
eterized model’s capacity to memorize individual training sequences, potentially exposing
critical information such as secrets or personal identifiers. Follow-up research22 focused on
reducing false positive rates, making such an attack more practical.
More recent research focuses on evaluating how membership inference attacks can be ap-
plied to extract information from fine-tuning and in-context learning data. Fine-tuning is the
process of taking a pre-trained base model and adjusting it based on a new dataset, while
in-context learning is a method that queries trained models, providing instructions and
examples for a new task as part of the input.
This distinction between pre-training and fine-tuning is important because, in many appli-
cations, fine-tuning is performed on sensitive or private data, while a base model is pre-
trained on public data that does not expose secret information. This is the case regardless
of whether the base model is provided by an LLM service provider like OpenAI and Antropic
or a self-hosted open-source model is used.
Research23 clearly shows that data used for fine-tuning can be extracted by querying the
model, exposing this information to an attacker. Fortunately, practitioners can have a strong
impact on how much information an attacker can extract from the model, by choosing the
specific fine-tuning method.
For in-context learning, where information is provided as part of the user prompt to control
the model’s responses, research has shown24 that attackers can extract parts of the prompt
easily.
21
Carline et al., (2021), “Extracting Training Data from Large Language Models”
22
Mireshghallah1 et al., (2022), “Quantifying Privacy Risks of Masked Language Models Using Membership Inference
Attacks”
23
Mireshghallah et al., (2022), “Memorization in NLP Fine-tuning Methods”
24
Duan et al., (2023), “On the Privacy Risk of In-context Learning”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
232 8 Machine Learning Security
In summary, one has to be aware that information can be extracted about the training data
as well as operation instructions.
8.4.2 Jailbreaking
Modern user-facing large language models (LLM) systems contain instructions, called sys-
tem messages, on how to interact with the user, avoid discrimination against specific groups
or genders, and prevent the spread of misinformation or criminal content.
Jailbreaking refers to the process of manipulating an LLM to act against these instructions25.
For example, a company uses an LLM to moderate public discussions on its website or mar-
keting channels about its products. This service could be targeted by an attacker who jail-
breaks the LLM, causing it to respond inappropriately and damage the company’s reputa-
tion, as the LLM’s misbehavior is attributed to the company.
Although research on jailbreaking is relatively recent, it has been shown26,27 that the de
velopment of new jailbreaks and defenses resembles an arms race between attackers and
defenders, similar to the development of antivirus software or software vulnerabilities in
cybersecurity. How difficult it is to prevent jailbreaking can be seen in recent works28 where
systems where compromised depside applying mitigation strategies that are out of the
question for most normal use cases. As discovered jailbreaks can often be applied to multi-
ple models and LLM services, it is crucial to stay up-to-date and continuously adjust system
defenses against new jailbreak versions. While systems like Nvidia’s NeMo-Guardrails29
have been developed to act as “AI-Firewalls” and protect against jailbreaks and LLM abuse,
they have also been shown30 to be imperfect and likely can be circumvented.
One of the new frontiers of LLM systems is their multi-modal use, which adds the ability to
understand and create images for conversational agents. Like text-only systems, guardrails
must be applied to restrict the system’s use and the content it generates. Recent research31
in jailbreaking multi-modal conversational agents has shown how images provided during
a conversation can be used to bypass a model’s defenses and make it act contrary to its in-
structions. Other research32 has extended jailbreaks to modalities like audio, and it can be
assumed that any additional modality and input source has the potential to be used as a new
channel for jailbreaking LLMs.
Figure 8.15 illustrates how a specially crafted image can be used to disable a model’s de-
fenses, causing it to obey harmful instructions.
25
Kang et al., (2023), “Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks”
26
Deng et al., (2023), “JAILBREAKER: Automated Jailbreak Across Multiple Large Language Model Chatbots”
27
Zou et al., (2023), “Universal and Transferable Adversarial Attacks on Aligned Language Models”
28
Schulhoff et al., (2023), “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a
Global Scale Prompt Hacking Competition”
29
https://github.com/NVIDIA/NeMo-Guardrails
30
https://www.robustintelligence.com/blog-posts/nemo-guardrails-early-look-what-you-need-to-know-before-
deploying-part-1
31
Qi et al., (2023) “Visual adversarial examples jailbreak aligned large language models”
32
Bagdasaryan et al., (2023), “(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.4 Machine Learning Security of Large Language Models 233
In summary, a successful jailbreak attack is using specially crafted instructions (in multiple
modalities) to convince an LLM to expose information or react against its operation instruc-
tions.
Another way to attack systems that apply LLMs to process user input is prompt injection.
Prompt injection is a combination of data poisoning (see section 8.3.3, Data Poisoning) with
an optional jailbreak. An attacker can use a jailbreak to circumvent defenses provided as
system instructions or input filters and combines them with harmful instructions that can
lead to a compromise the systems integrity or confidentiality.
An example attack scenario involves an LLM-enhanced enterprise search system that works
with various documents and media within a company. Users can ask questions and engage
in conversations about these documents. An attacker injects malicious prompts in the form
of documents into the system, which are then used to answer related user requests. In cases
where the poisoned documents are used, the malicious prompts modify the LLM’s behavior,
causing the answers to be useless or, in a worse case, tricking a user into executing steps
that benefit the attacker.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
234 8 Machine Learning Security
In short, as part of a successful prompt injection attacks, the attacker introduces malicious
content that makes the system respond under specific conditions, which might be con-
trolled by the attacker, in unintended ways.
Figure 8.16 illustrates a scenario in which an attacker introduces malicious content into the
system indirectly. In such a document-based question answering system, a model makes
use of documents and media to answer user requests. This exposes the system to an at-
tacker that is able to insert malicious documents into the system, compromising the sys-
tems integrity.
33
Greshake et al., (2023), “Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications
with Indirect Prompt Injection”
34
Cohen et al., (2024), “Here Comes The AI Worm: Unleashing Zero-click Worms that Target GenAI-Powered
Applications”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
236 8 Machine Learning Security
with sufficient expertise about machine learning security a company should consider host-
ing its own models.
Based on the AI threat model, the individual risk for a project can be evaluated, and deci-
sions can be made about which risks a project is willing to take and what needs to be ad-
dressed. This can help answer questions about architectural decisions, such as using LLM
service providers or hosting your own models, or more fine-grained features about how
much information and control is exposed to the user. Additionally, an AI threat model helps
decide where to invest more resources, for example, in machine learning security training
or in buying expertise in the form of consultancy.
One important consideration when thinking about AI threat modeling is the level of
exposure and access an attacker has to the systems. When describing the attack surface in
Figure 8.1, we highlighted the exposure of the ML system during inference to the public.
This will not be the case in all scenarios.
Excluding very recent developments accelerated by the integration of LLM services in cus-
tomer-facing applications, most ML systems have been used for internal services and appli-
cations. This, of course, enables their abuse and access to an attacker who has gained access
to the internal network of a company, but the result of such an attack will, in many cases, be
covered by existing cybersecurity protocols. For companies that have their first commercial
ML services exposed to the public, machine learning security is particularly important, as
the abuse and interruption of these services can have significant reputational and legal
consequences.
Once there is a clear understanding of who would be attacking what parts of a system for
which reason, a more formal thread modeling can be performed that follows well estab-
lished best practices from cybersecurity. The reader is encouraged to have a look at methods
like STRIDE35 that can be used to create thread models for different systems and processes
as well as tools like OWASP’s threat-dragon36 that can be used to create a threat report.
■ 8.6 Regulations
Before closing the chapter, we want to have a quick look at upcoming European regulations
relevant to the topic of machine learning security.
In particular, we will focus on two related acts:
EU Artificial Intelligence Act (AIA)37: This act regulates the use of AI based on a “risk-
based” approach, imposing regulatory burdens when an AI system is likely to pose high
risks to fundamental rights and safety.
EU Cyber Resilience Act (CRA)38: This act describes the cybersecurity requirements for
hardware and software products with digital elements placed on the market of the Euro-
35
https://learn.microsoft.com/en-us/archive/msdn-magazine/2006/november/uncover-security-design-flaws-
using-the-stride-approach
36
https://github.com/OWASP/threat-dragon
37
https://www.euaiact.com/
38
https://www.european-cyber-resilience-act.com/ and https://digital-strategy.ec.europa.eu/en/policies/
cyber-resilience-act
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
238 8 Machine Learning Security
pean Union, with a focus on increasing the resilience and security of products and soft-
ware.
Both acts have recently39,40 been agreed upon, with both coming into effect by 2025. The AIA
applies strong regulations to high-risk systems. Important from a machine learning security
perspective are the obligations to ensure the cybersecurity best practices as described in
Article 1541:
“4. High-risk AI systems shall be resilient as regards attempts by unauthorised third parties to
alter their use or performance by exploiting the system vulnerabilities. The technical solutions
aimed at ensuring the cybersecurity of high-risk AI systems shall be appropriate to the
relevant circumstances and the risks. The technical solutions to address AI specific
vulnerabilities shall include, where appropriate, measures to prevent and control for
attacks trying to manipulate the training dataset (‘data poisoning’), inputs designed to
cause the model to make a mistake (‘adversarial examples’), or model flaws.”
European Artificial Intelligence Act
What the AIA defines as high-risk systems is described in the Annex III42 which includes
use cases of AI in critical (digital) infrastructure, education and vocational training, employ-
ment, workers management, and others.
Unfortunately, a recent report43 on the cybersecurity aspects of the AIA has found that cur-
rent machine learning security methods might not be sufficient to protect AI models and
ensure complete compliance for high-risk AI systems.
“Although the state of the art for securing AI models has limitations, AI systems may still
achieve compliance with the AI Act’s cybersecurity requirement as long as their cybersecurity
risks are effectively mitigated through other measures not exclusively deployed at
AI model level. However, this may not always be possible, and indeed for some high-risk
AI systems using emerging AI technologies, it may not be feasible to achieve compliance
with the cybersecurity requirement of the AI Act unless in their design these systems additionally
introduce new cybersecurity controls and mitigation measures of proven effectiveness.”
Cybersecurity of Artificial Intelligence in the AI Act
The report highlights the importance of seeing machine learning security and the protec-
tion of AI models not as an isolated topic, but as part of the cybersecurity of the complete AI
system.
To increase the cybersecurity resilience and the safety of software, including AI systems is
the goal of the Cyber Resilience Act (CRA), which is why (in the context of machine learning
security) the AIA and the CRA come hand in hand.
The CRA in Annex I44 defines essential cybersecurity requirements for software that falls
into Class I, II or high-risk AI systems, as defined in Annex III45.
39
https://ec.europa.eu/commission/presscorner/detail/en/ip_23_6473
40
https://www.consilium.europa.eu/en/press/press-releases/2023/11/30/cyber-resilience-act-council-and-
parliament-strike-a-deal-on-security-requirements-for-digital-products/
41
https://www.euaiact.com/article/15
42
https://www.euaiact.com/annex/3
43
https://publications.jrc.ec.europa.eu/repository/handle/JRC134461
44
https://www.european-cyber-resilience-act.com/Cyber_Resilience_Act_Annex_1.html
45
https://www.european-cyber-resilience-act.com/Cyber_Resilience_Act_Annex_3.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.7 Where to go from here 239
The reader is encouraged to read Annex I in detail, but with the limited space available in
this chapter, one can summarize the requirements as applying and verifying the effect of
state of the art cybersecurity best practices by internal and third-party (i. e., penetration
tests).
Performing regular penetration tests by red teams is an established process and the stan-
dard as well as required in many industries like finance, health or safety-critical applica-
tions. Based on the AIA, CRA and the explosion of AI related security research, one can
expect that penetration tests will in the future include AI-focused attacks of the kind ex-
plained in this chapter.
In summary, regulations require the operators of ML systems to take security seriously, and
current attack and defense mechanisms demand a holistic approach that is ensuring the
cybersecurity of the complete AI system. The importance of machine learning security will
continue to grow hand in hand with other cybersecurity affords to protect modern computer
systems.
46
https://atlas.mitre.org/
47
https://mltop10.info/
48
https://llmtop10.com/
49
https://owasp.org/
50
https://owasp.org/www-project-top-ten/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
240 8 Machine Learning Security
For readers interested in research in the field, we recommend the curated publication lists
on machine learning security51 and privacy52 and the conference SaTML53.
To engage with a community about machine learning security, we recommend the MLSecOps
community54, which offers regular updates and news in the form of podcast interviews and
blog posts.
Companies with high exposure to business-critical ML systems and the budget to invest in
multiple teams for their cybersecurity efforts often create three teams (red, blue, yellow)
focused on different aspects of machine learning security and cybersecurity55. In this
configuration, the red teams focus on developing and understanding ways to attack their
systems, the blue teams focus on developing defense mechanisms, and the yellow teams
focus on incorporating the findings of the red and blue teams into the products and ser-
vices. A good starting point for such efforts is the Adversarial Robustness Toolbox56, an
open-source Python-based machine learning security library developed by IBM that pro-
vides attacks ad defenses for a wide range of methods.
■ 8.8 Conclusion
Like cybersecurity in the past, machine learning security has been an underappreciated
fringe topic in the field of artificial intelligence.
However, with the growing adoption and widespread use of large language models (LLMs)
this is changing, increasing the necessity for ML engineers and developers to understand
the basic workings and threats of attacks against machine learning models.
ML systems can be attacked at any stage of the model life-cycle such as data collection,
training, and inference. These attacks can compromise the confidentiality, integrity, and
availability of the system, exposing sensitive training data, manipulating model behavior
with specially crafted input data, or rendering the system unusable for its intended pur-
pose.
Using LLMs in production is particularly challenging at the moment, as there is a high
market pressure to include them in existing products and services despite many unsolved
security issues like jailbreaks and prompt injection. For companies hosting their own LLMs
it is essential to stay up to date with new attack methods and defenses.
Decades of arms race between attackers and defenders in cybersecurity have shown that it
is impossible to defend a system completely. It is therefore essential that a system is con
tinuously monitored and checked against tempering in order to detect and defend against
intruders.
51
https://github.com/gnipping/Awesome-ML-SP-Papers
52
https://github.com/stratosphereips/awesome-ml-privacy-attacks
53
https://satml.org/
54
https://mlsecops.com/
55
https://danielmiessler.com/p/red-blue-purple-teams/
56
https://github.com/Trusted-AI/adversarial-robustness-toolbox
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
8.9 In a Nutshell 241
AI threat modeling is one of the first steps to improve a projects security, as it serves to as-
sess a project’s individual risk exposure and informs decisions on how to build and operate
a system.
The European Artificial Intelligence and the Cyber Reliance Act require the application of
state of the art cybersecurity best practices, including regular penetration tests to high-risk
AI systems, that will result in an increase in relevance of machine learning security for
compliance reasons.
■ 8.9 In a Nutshell
Mathematics
Annalisa Cadonna
“We will always have STEM with us. Some things will drop out of the public eye
and go away, but there will always be science, engineering and technology. And
there will always, always be mathematics”
Katherine Johnson
What are the topics in mathematics that are needed to understand machine
learning models and algorithms?
What are matrices and vectors?
How does matrix decomposition work and how is it used in data science?
What is gradient descent and how does it work?
What is constrained optimization?
How can we incorporate uncertainty in machine learning?
What is a random variable?
What is the difference between continuous and discrete random variables?
When are two variables independent?
Machine learning algorithms have become readily accessible in the last few years thanks to
easy-to-use software libraries, where low-level details are not shown to the practitioner.
Such accessibility allows a growing number of applications to exploit the potential of ma-
chine learning and artificial intelligence. However, it also poses a risk: the practitioners
might not understand fully the inner workings and limitations of the algorithms they use.
While skilled in programming languages and computation, data scientists with a back-
ground in computer science and software engineering might lack the mathematical founda-
tions to deeply understand some machine learning models and algorithms.
The goal of this chapter is to make the reader familiar with the four mathematical fields
which constitute the basis of machine learning and artificial intelligence: linear algebra,
calculus, optimization, and probability theory. The first three fields are deterministic, mean-
ing they do not leave room for uncertainty. Probability theory, on the other hand, provides
us with tools to account for uncertainty, which is key in data science.
Of course, a chapter is not enough to explain all the mathematics at the base of data science!
We have omitted the topics that are not crucial for the understanding of the next chapters of
this book. If you are familiar with any of the topics covered in this chapter, feel free to skip
the relative section. If you are new to the topics, this chapter will equip you with the skills
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
244 9 Mathematics
to understand the foundations of most models and algorithms employed in data science. You
should be able to follow along even without a quantitative background. However, we assume
that you are familiar with the set of real numbers and its properties.
In Section 9.1 we outline the foundations of linear algebra; Section 9.2 presents a few topics
in Calculus and Optimization. Section 9.3 introduces the reader to probability theory. Fi-
nally, Section 9.4 concludes by summarizing the key concepts presented in the chapter.
In the next pages, you will often see the notation R2, R3, . . . Rn. What do these symbols
mean? R2 is the set of all ordered pairs of real numbers, or 2-tuple of elements of R. This
concept can be generalized to Rn, which is the set of all n-tuples of elements of R. Such
n-tuples of elements of R are called vectors in linear algebra. Another way to think of vec-
tors is as arrays of numbers.
We indicate vectors with a lowercase bold letter. When we want to write a generic vector,
each element is written lowercase and is identified by one single index. Moreover, we use by
default the so-called column vectors format, in which the elements of the vector are placed
under each other in a column. As an example, we write:
Vectors live in a structured space called (finite) vector space. Loosely speaking, a vector
space is any space which contains mathematical objects that can be multiplied by scalars
and added together, resulting in an object of the same type. In linear algebra, a scalar is an
object that scales the elements of a vector space: in our case, a scalar is simply a number.
According to this definition, is a vector space.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.1 Linear Algebra 245
As mentioned, adding two vectors of the same length results in another vector, which is
obtained by element-wise addition. When we multiply a vector by a scalar coefficient we
obtain a scaled vector, where each element of the vector is multiplied by the scalar.
The second type of object at the foundation of linear algebra is a matrix (plural ‘matrices’).
A matrix is a rectangular object with rows and columns. We usually denote a matrix with a
bold capital letter and denote each element of the matrix with a capital letter with two in-
dexes, one for the row number and one for the column number. For example,
is a 2-by-3 matrix. Of course, when doing practical calculations, we replace the generic ele-
ments with real numbers.
As we did with vectors, we can define the addition of matrices as element-wise addition and
the multiplication by a scalar coefficient as the operation that returns a matrix with every
element multiplied by a scalar. To add two matrices, they must have the same dimension,
that is, the same number of rows and columns.
When doing calculations that involve matrices and vectors, we often encounter the concept
of the transpose of a matrix. The transpose of a matrix is the matrix with its rows and col-
umns inverted. This means that the number of rows becomes the number of columns in the
transpose, and the number of columns becomes the number of rows. The transpose of is
denoted with .
An important consideration to make before we proceed is that a vector is a matrix. Specifi-
cally, a column vector of length is a matrix with rows and one column, and a row
vector of length is a vector with one row and n columns. Hence, the transpose of a column
vector is a row vector with the same elements, and vice versa.
An important class of matrices in linear algebra are the so-called square matrices, which
are matrices where the number of rows is the same as the number of columns. The most
popular square matrix is the identity matrix, which has ones on the diagonal and zero
elsewhere. We denote the n-by-n identity matrix with . As an example, we write the 3-by-3
identity matrix as
The identity matrix plays the role of the number one in linear algebra: each square matrix
multiplied by the identity matrix results in the original matrix itself: that is, . This
will become clear once you know how matrix multiplication works. We will get there in
Section 9.1.2.
The identity matrix allows us to define the inverse of a matrix. Specifically, if there exists a
matrix for which it holds that , is called the inverse of and denoted as
. When the inverse of a matrix exists, the matrix is called ‘invertible’ or ‘nonsingular’.
When the inverse of a matrix does not exist, the matrix is called ‘noninvertible’ or ‘singular’.
If a matrix is invertible, its inverse is unique, which only one inverse exists. When ex-
ists, there are several algorithms to calculate it analytically. However, in data science, we
rarely calculate the inverse directly, mostly because of computational limitations. Nonethe-
less, the inverse is important for two main reasons:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.1 Linear Algebra 247
is a 3-by-3 diagonal matrix. While the most popular diagonal matrices are square, we can
have also rectangular matrices that are diagonal, if the only elements different than zero are
the ones with the same column and row index. Other important square matrices are trian-
gular matrices. A square matrix is lower triangular if all the entries above the diagonal are
zero, while it is upper triangular if all the entries below the main diagonal are zero.
Another linear algebra object worth knowing, which could make your calculations easier, is
the trace. The trace is the sum of all the diagonal entries of a square matrix. For an n-by-n
matrix the trace is denoted as and defined as
The trace operator allows us to manipulate operations involving matrices by using two use-
ful properties:
the trace operator is invariant to the transpose, , and
the trace operator is invariant to multiplication obtained by taking the last element of the
product and placing it in the first position. This means that .
Before proceeding to describe the operations between matrices and vectors, we want to
make the reader aware of the existence of tensors, which can be seen as a generalization of
matrices to a higher dimension. Tensors play an important role in fields where structured
data has more than two dimensions. For example, a red-green-blue image can be repre-
sented as a three-dimensional array. The first and second dimension indicate the pixel, and
the third dimension contains the red, green, and blue values for each pixel.
In Section 9.1.1, we saw how both matrices and vectors can be multiplied by a scalar and
added to each other. Now, we explain a few other operations involving matrices and/or vec-
tors.
Matrix Multiplication
Let us consider an m-by-n matrix and an n-by-p matrix . The product is defined to be
the m-by-p matrix , with entries in row i and column j given by
Equation 9.1
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
248 9 Mathematics
In words, Equation 9.1 says that the entry in row i and column j of is computed by taking
the i-th row of and the j-th column of , multiplying the corresponding entries, and then
summing up.
Vector Multiplication
When we deal with vectors, we encounter two types of products.
1. The inner product, also called dot product or scalar product, is obtained by summing the
products of corresponding entries in the two vectors. The inner product results in a sca-
lar and is only possible between two vectors of the same length. We denote the inner
product between and as or .
2. The outer product between two vectors is obtained by multiplying each element of the
first vector by each element of the second vector, and it results in a matrix. If the two
vectors have dimensions n and m, the resulting matrix is an n-by-m matrix. We denote the
outer product between and as or .
Matrix-vector Multiplication
To define matrix-vector multiplication, it is useful to think of a vector as a matrix. As we saw
before in the case of matrix multiplication, the number of columns of the first element
needs to equal the number of rows of the second element. Hence, we can multiply a row by
a matrix, or a matrix by a column vector.
Equation 9.2
Eigenvalues and eigenvectors reveal important information about a matrix. The eigenvec-
tors of a matrix are the directions along which applying the linear transformation induced
by the matrix results simply in scaling (and/or flipping the sign). The factors the eigenvec-
tors are scaled by are called ‘eigenvalues’. In mathematical terms, this concept is repre-
sented via the eigenvalue equation: a non-zero vector is an eigenvalue of if it satisfies
Equation 9.4
for some scalar .
In general, when we find one eigenvector of a matrix, any vector that is a multiple of the
eigenvector satisfies Equation 9.4 and hence is an eigenvector itself. To simplify things, one
usually considers the unit eigenvectors, which is simply an eigenvector whose Euclidean
norm is one.
A matrix whose eigenvalues are all positive is called ‘positive definite’, and a matrix whose
all eigenvalues are negative is called ‘negative definite’. If there are some zeros amongst the
eigenvalues, we talk about ‘positive semidefinite’ and ‘negative semidefinite’. The concept of
a positive definite matrix is central in linear algebra, as positive definite matrices are usu-
ally “nice” matrices to work with.
The product of the eigenvalues of a matrix is the determinant, denoted as . The de-
terminant is a number, and its absolute value can be interpreted as the scaled factor by
which applying the linear transformation induced by expands or contracts volumes. If the
determinant is one, then preserves the volume. If the determinant is zero, the volume is
completely lost. When the determinant of a matrix is zero, the matrix is not invertible. A
useful property of the determinant is that
Let us assume that a matrix has n eigenvectors with corresponding eigenvalues. Let us
consider a matrix , whose columns are the eigenvectors of , and a diagonal matrix ,
whose diagonal contains the eigenvalues of . The eigendecomposition of is given by
Equation 9.5
We also say that a matrix is diagonalizable if it is similar to a diagonal matrix. In mathe-
matical terms, being similar to a diagonal matrix means that there is an invertible matrix
such that is diagonal.
Not every matrix can be decomposed using eigenvalue decomposition, and for some matri-
ces the eigenvalues and eigenvectors can be complex. In data science, we usually have to
decompose real symmetric matrices, such as the covariance matrix. Real symmetric matri-
ces are very likable, as they have real-valued eigenvalues and eigenvectors. Moreover, their
eigenvector matrix is orthogonal, which means that . The decomposition for real
symmetric matrices always exists and Equation 9.5 becomes
We arrange the eigenvalues in the diagonal in descending order. Under this convention,
if the eigenvalues are unique, then the eigendecomposition is unique. The eigenvectors
corresponding to unique eigenvectors are linearly independent.
Principal Component Analysis (PCA) is a very well-known machine learning algorithm
which is based exclusively on linear algebra and specifically on the eigendecomposition. PCA
is largely used in data science for dimensionality reduction. Often, PCA is a step used to re-
duce the dimension of the data before processing with classification, regression, or clustering.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
252 9 Mathematics
The goal of PCA is to find a low-dimensional representation of the data that maintains as
much information as possible. This is equivalent to capturing the largest amount of vari-
ance in the data. It can be shown, by following an iterative approach, that the eigenvectors
of the covariance matrix are the directions of maximum variance. The variance of the data
projected on each eigenvector is then equal to the corresponding eigenvalue.
Matrix decomposition is useful in data science for various reasons. First, it allows us to look
at the data from a different perspective. Second, it enables dimensionality reduction, while
retaining most of the signal present in the data. Third, decomposing a matrix can facilitate
computation. In fact, it allows to decompose matrices into simpler parts, making it easier to
carry out complex matrix operations.
In this section we look at three additional matrix decomposition methods: LU decomposi-
tion, singular value decomposition (SVD) and Cholesky decomposition.
LU Decomposition
LU decomposition is often used to efficiently solve systems of equations without having to
calculate the inverse. It consists in decomposing a matrix in the product of a lower triangu-
lar matrix, and an upper triangular matrix, That is, we can write
Let’s see how can solve the system in Equation 9.2 using LU decomposition. First, we re-
place with : the system becomes Now, we can solve this system in two steps:
first, we solve for ,
second, we solve for
Each system is straightforward to solve because the matrices involved are triangular and we
can use forward substitution and backward substitution, respectively.
Equation 9.6
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.2 Calculus and Optimization 253
Equation 9.7
where and are the matrices in the singular value decomposition of . is obtained by
the diagonal matrix in the singular value decomposition by replacing the non-zero ele-
ments with their reciprocal and then taking the transpose.
Going back to our system of equations in Equation 9.2 with a non-square matrix :
If has more columns than rows, through the Moore-Penrose pseudo-inverse we obtain
one of the infinitely many solutions. Specifically, we obtain the solution with the smallest
Euclidean norm.
If has fewer columns than rows, there are no solutions. In this case, the Moore-Penrose
pseudoinverse provides a solution such that is the closest to in the Euclidean norm.
Cholesky Decomposition
The Cholesky decomposition is considered the square root for matrices. As the square root
of a number allows us to decompose the number into two identical components, the Chole-
sky decomposition splits a matrix into two components, which are identical except that one
is the transpose of the other. The Cholesky decomposition can be applied only to symmetric,
positive definite matrices, and looks like where is a lower diagonal matrix with
positive diagonal elements. is unique and is called the ‘Cholesky factor’ of .
Since the covariance matrix is a symmetric positive definite matrix, the Cholesky decompo-
sition is often used in statistics and data science. For example, it is used for efficiently gen-
erating samples from a multivariate Gaussian distribution, and to perform linear transfor-
mations of random variables. The Cholesky decomposition is also useful to efficiently
calculate the determinant of a matrix. In fact, triangular matrices have the property that the
determinant is the product of the diagonal elements. This, together with the fact that
, makes it straightforward to calculate the determinant.
Equation 9.8
where is the input and the output. The input and output of a function can be numbers,
vectors, or matrices. In many applications in machine learning, is a vector in and
is a real number. In this case, we say that the function maps to .
9.2.1 Derivatives
If you have worked with machine learning algorithms, it is likely that you have heard mul-
tiple times the term ‘gradient’. The gradient is a generalization of the derivative to a multi-
variate setting. Let’s quickly review the concept of derivative before moving to the gradient
and other more interesting objects. The derivative of a function measures how much the
output is sensitive to a change in the input. Consider a univariate function , which is
well-defined in the neighborhood of a number . The derivative of in a is defined as the
following limit:
Equation 9.9
Example: Derivative
Consider the function . We want to find the derivative in , we
have that
We can apply the derivative transformation for each possible value of the input x, in which
case we obtain another function, called the ‘derivative function’ or simply ‘derivative’.
Using the definition in Equation 9.9, we can calculate most derivatives. However, it is useful
remember a few cases and rules that make our life easier:
the derivative of a polynomial is ,
the derivative of the exponential function is the exponential function itself, that
is, .
There are then a few rules for derivation, including the famous chain rule. The main rules
of derivation are the following:
product rule:
fxgx'=f'xgx+fxg'x,
quotient rule:
sum rule:
chain rule:
The geometric interpretation of the derivative is that of the tangent to a function. When we
calculate the derivative at a specific point :
if the derivative is greater than zero, the function is increasing in
if the derivative is smaller than zero, the function is decreasing in ,
if the derivative in a is equal to zero, is considered a stationary point. A stationary point
can be a maximum, minimum or a saddle point.
When we talk about maximum and minimum, we need to distinguish between local and
global:
a local maximum is a point where the function is larger than all its neighboring points,
but is not the largest globally,
a local minimum is a point where the function is smaller than all its neighboring points,
but is not the smallest globally,
the point in which the function is the largest is called the ‘global maximum’, and
the point in which the function is the smallest is called the ‘global minimum’.
Sometimes we are also interested in a derivative of a derivative, which we call the ‘second
derivative’. The second derivative gives us information on how the derivative changes as we
change the input to our function. The second derivative measures curvature:
if the second derivative is positive, the function curves upwards,
if the second derivative is negative, the function curves downwards,
if the second derivative equals zero, there is no curvature.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
256 9 Mathematics
In Figure 9.1 we can see an example of a function with a local maximum, a local minimum
and saddle point.
Figure 9.1 Example of a function with local maximum, local minimum, and saddle point
In machine learning, we often deal with functions that take multiple inputs and return a
univariate output. For simplicity, let us consider the case in which f takes in input two vari-
ables, that is . We can find the derivative of the function with respect to each of
the two variables, while keeping the other constant: each one of these derivatives is called
partial derivative. The gradient, denoted as , is the vector that contains the two partial
derivatives.
Example: Gradient
We want to find the gradient of the function .
We start by finding the two partial derivatives using the chain rule for each
derivative:
Instead of considering two inputs, we can consider n inputs and write them in compact form
as a vector : we have then the function which maps to We can think about the
gradient as the derivative of a function with univariate output with respect to a vector. The
product rule, sum rule and chain rule apply not only to each partial derivative separately,
but also to the gradient itself.
Important to remember is that in the case of a function with multiple inputs, the stationary
points are the points where all the elements of the gradient are equal to zero.
In Section 9.2.1 we learned of the importance of the second derivative. As the gradient is the
generalization of the first derivative, the generalization of the second derivative is the Hes-
sian matrix, denoted with . The Hessian matrix contains all the second order derivatives.
We can have:
the derivative with respect to of the derivative with respect to , keeping all the other
x’s constant, which we denote as ,
the second derivative with respect to , keeping all the other input variables constant,
which we call the ‘partial second derivative’ and denote as
The Hessian matrix looks like
How can we use the Hessian to determine what type of stationary point we are looking at?
We can check the following:
if the Hessian is positive-definite (all eigenvalues are positive) at the stationary point,
then the stationary point is a (local) minimum,
if the Hessian is negative-definite at the stationary point (all eigenvalues are negative),
then the stationary point is a (local) maximum,
if the Hessian has both positive and negative eigenvalues, then the stationary point is a
saddle point,
in all the other cases, we cannot say.
Gradient descent is the most well-known algorithm used in machine learning to find the
optimal configuration of parameters of a model, given the data. It is very popular and has
been extended to be computationally efficient in specific applications. Combined with the
backpropagation algorithm, it is widely applied to train neural networks. The goal of this
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
258 9 Mathematics
chapter is to lay the foundations so the reader is equipped with the tools to understand
machine learning algorithms on her own, so we will only explain the simplest versions of
gradient descent. Refer to other chapters
Again, consider our function which takes as input a vector and returns a univariate
output, and assume that our goal is to minimize it. The goal could also be to maximize the
function but in this case, we can simply change the sign and minimize. In machine
learning problems, is often the loss function of our model and is the set of model
parameters. Notice that usually the set of parameters is denoted with , so as not to confuse
it with the dependent variables. The loss function is denoted with L. Hence, you will have to
find the minimum of but the procedure does not change.
We indicate with the input vector which minimizes How can we find In some
cases, we can find in closed form, that is, analytically. However, in most machine learning
algorithms, the closed form solution either does not exist, or it exists but is too computation-
ally expensive to calculate, or when we replace the solution formula with numbers, we en-
counter numerical problems.
Imagine a valley: our goal is to get to the lowest point in that valley. Intuitively, when we
start, we will want to move downhill. If we want to move downhill fast, we will take the
steepest route. We might stop every few meters, re-evaluate what the steepest direction is
and adapt our path. When we cannot go downhill anymore, we stop, and we say we have
arrived. The distance between re-evaluations is an important aspect: if we stop to re-evalu-
ate too often, it will take us a long time to get to the bottom. If we don’t stop often enough,
we will not adapt to the ever-changing slope. Not only that, we will risk overshooting the
lowest point. This is the concept behind gradient descent: the goal is to minimize a function
by iteratively moving in the direction of steepest descent. The step size done before re-eval-
uating the gradient is called the ‘learning rate’.
We outline here a simple gradient descent algorithm to find the minimum of . The gra-
dient descent consists of many iterations. We denote with the value taken by the inputs
at iteration i. We do the following steps:
we start with a guess , and then
we iterate according to where is the step size or learning rate,
and at the current step,
we stop when we are not really moving anymore, or we have completed a predefined max-
imum number of steps.
For appropriate step sizes, the algorithm converges to a local minimum. Choosing an appro-
priate step-size is thus very important. If the step size is too small, convergence might be too
slow. If the step size is too large, gradient descent can overshoot and even diverge. Some
extensions of gradient descent adapt the step size at each iteration. The convergence of the
gradient descent, in general, might be slow and have a zig-zag behavior. An extension to
gradient descent called ‘gradient descent with momentum’ introduces a term that remem-
bers what happened in the previous iteration. This makes the gradient updates smoother, in
a moving average fashion.
Computing the gradient can be computationally very expensive. For this reason, it is some-
times preferable to calculate an approximate gradient and use the approximate gradient to
decide which direction to take next. Stochastic gradient descent is a stochastic approxima-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.2 Calculus and Optimization 259
tion of gradient descent which is widely used in machine learning. The loss function is
constructed as the sum of many terms, such as one term for each data point. The loss func-
tion in this scenario looks something like
Equation 9.10
Remember that the gradient is a linear operator, which means the gradient of a sum of
functions equals the sum of the gradients of each function. So, the gradient of the function
in Equation 9.10 is
Equation 9.11
Equation 9.12
The KKT multipliers are contained in the vectors and . Instead of the constrained optimi-
zation problem in Equation 9.11, we can now solve the following unconstrained optimiza-
tion problem: where is the Lagragian dual func-
tion, which is always convex.
If there are no inequality constraints, the method is called method of the Lagrange multipliers.
An inequality constraint is said to be active when the solution is on the boundary of the area
defined by the constraint, that is, is active if . If an inequality has no influ-
ence on the solution, the relative constraint is said to be inactive, and its corresponding KKT
multiplier is set to zero.
Some machine learning techniques that use constrained optimization and the trick to move
between the primal and the dual formulation are support vector machines (SVMs) and reg-
ularization methods, such as Lasso and Ridge regression.
The concept of probability for simple and repeatable tasks is straightforward. If you toss a
coin many times, you expect to obtain a head about half of those times. Hence, you can say
that the probability of flipping a head is 50 %. According to this definition, called a ‘frequen-
tist’ definition, the probability is the number of times you observe an outcome (in our exam-
ple, obtaining a head) divided by the number of times you execute an experiment (tossing
the coin).
This kind of definition, however, is not always applicable. Imagine you hear your friend Jane
say that with 80 % probability your friend John will not come to the party. Did Jane come to
her conclusion after repeating many times the same party and counting how many times
John showed up? Clearly not. Jane used a probability to express her belief about John’s be-
havior. This interpretation of probability is called ‘Bayesian probability’.
While there is an ongoing debate between frequentists and Bayesians, we will not wade into
it here. In data science, you will encounter some problems where a frequentist approach is
the most appropriate, and others in which a Bayesian approach is required in order to be
able to estimate the parameters of your model.
Uncertainty is introduced in machine learning models through objects called random vari-
ables. Imagine you are rolling a die. The number you roll is a random variable: it can be
“randomly” any of the numbers between one and six. Now, imagine you are watching a
football match and a player kicks a ball; the distance the ball will cover is also a random
variable. What is the difference between the two random variables? The number we obtain
when rolling a die is a discrete random variable as it can take only discrete values. How far
a football player kicks a ball, expressed in meters, is a continue random variable, as it will
take a real value. Clearly, the number we obtain by rolling a die and the distance covered by
a ball follow different behaviors. How do we express the behavior of a random variable?
Let’s go back to the example of flipping a coin. The outcome of the experiment can be either
head or tail. When we have a random variable which has only two possible values, such a
variable is called ‘binary’. We typically code the potential values as one or zero and we call
arbitrarily a success, and a failure. The behavior of a binary random variable is
described by the Bernoulli distribution.
We denote the probability of success with p, and write the pmf as
The flipping the coin experiment can be modeled though a Bernoulli distribution, with pa-
rameter p = 0.5.
Now, imagine that you flip 100 coins and count the number of heads. You repeat this exper-
iment many times. Each experiment consists of 100 flips, which are called trials. What is
the distribution of the number of heads? The number of heads follows the so-called Bino-
mial distribution. The parameters of the binomial distribution are the probability of suc-
cess and the number of trials in each experiment. In this specific example, the parameters
of the Binomial distribution are 0.5, which is the probability of flipping a head for each coin,
and 100, which is the number of flips for each experiment. Generalizing to a probability of
success and trials, we can write the pmf of a Binomial random variable as
where is called the ‘Binomial coefficient’. The Binomial coefficient comes from a branch
where is the rate parameter, is the Euler number and ! is the factorial operator. The
Poisson distribution is used to model the number of occurrences in a unit of time. For exam-
ple, it can be used to model the call per hour in a call center, so that staffing can be planned.
We describe the distribution of continuous random variables through the so-called ‘proba-
bility density function’ (pdf). Let us consider a variable which follows a uniform distri-
bution, that is it can take values uniformly in the interval Its pdf will be
If we are interested in the probability that the random variable will have a value between c
and d, with c larger than a and d smaller than d, we have that . This is
illustrated in Figure 9.2.
Equation 9.13
The Gaussian distribution has two parameters, and , which are the mean and the stan-
dard deviation, respectively (we will see a general definition of mean and standard deviation
in Section 9.3.2). When and , we say that the variable is a standard normal vari-
able.
The importance of the Gaussian distribution in statistics and machine learning is due to
many reasons, but two stand out:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
264 9 Mathematics
The first reason is the Central Limit Theorem (CLT). The CLT states that the sum of in-
dependent and identically distributed random variables converges to a Gaussian distribu-
tion as the number of variables involved in the sum increases. This is valid regardless of
the distribution of such variables, and it holds also for discrete variables.
The second reason is that most of the time the errors in statistics and machine learning
models are assumed to follow a Gaussian distribution. Placing a distributional assump-
tion on the errors allows us to test the significance of the parameters, and to build predic-
tion intervals.
Figure 9.3 depicts the popular bell curve, specifically the pdf of a standard normal variable.
Shaded in blue, you can see 95 % of the area under the curve. If you are familiar with testing,
you will know that this corresponds, on the x-axis, to the interval [-1.96, 1.96]. This means
that a standard normal variable will take values in such an interval with a probability of
95 %.
Expected value and variance are very useful for describing properties random variables.
The expected value, or mean, of a random variable is the value that we expect the variable
to take on average. This, of course, depends on the distribution that the random variable
follows.
The expected value of a function of a discrete random variable with pmf is
given by
Equation 9.14
where is the set of possible values that the random variable can take. When we con-
sider the identity function , we obtain the expected value of the variable itself.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.3 Probability Theory 265
For continuous random variables the sum in Equation 9.14 is replaced by an object called
an ‘integral’. We will not explain what an integral is here, but you can find it in any intro-
duction to calculus book. For now, you should just be aware that the integral is “like the
sum”, but for probability density functions.
While the expected value gives us information on the center of the distribution, the vari-
ance gives us information on the spread. The variance of a random variable is
Equation 9.15
The unit of the covariance is the product of the unit of and that of The variance and
covariance also depend on the size of the values that the variables can assume. It would be
nice to have a quantity that expressed the association between two random variables and
did not depend on the units or the size. Luckily, this quantity exists and is called correla-
tion. The correlation between two random variables and is defined as
Equation 9.16
The correlation is a number between minus one and one. The correlation is one when the
two variables are perfectly positively correlated, which means is always equal to . The
correlation is equal to minus one if the two variables are perfectly negative correlated,
which, similarly, means is always equal to minus . If the two variables are independent,
then the correlation between them is zero, while the opposite is not always true. To get a
better understanding of correlation, you can play a few rounds on http://guessthecorrelation.
com. It is important to say again that covariance and correlation capture linear associations
between variables, and no other types of association. To be exhaustive, we need to mention
that the correlation in Equation 9.16 is called Pearson’s correlation, and other types of cor-
relation exist which we will not cover here, such as Kendall rank correlation, Spearman
correlation, and Point-Biserial correlation.
While we have introduced the definition of mean, variance and covariance for random vari-
ables, in real life we do not observe the variables themselves, but a sample of their realiza-
tions. Sample mean, sample variance and sample covariance are key in statistics, and are
the ‘realized’ versions of the quantities described in this chapter.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
266 9 Mathematics
Equation 9.17
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
9.4 In a Nutshell 267
While you may not have seen Bayes’ theorem expressed as in Equation 9.17, you might have
seen its version for events, that is
Let’s use Bayes’ theorem with an example. Consider a test for a disease that is 90 % sensitive:
if a person with the disease takes the test, there is a 90 % chance the test is positive. The test
has a lower specificity, only 80 %, meaning if it is taken by someone without the disease, the
test will be positive with a probability of 80 %. Assuming 0.5 % of the people who take the
test have the disease, what is the probability that a person has the disease if they test posi-
tive? We apply Bayes’ theorem and find that
The denominator can be found with the so-called law of total probability:
This means that the probability of having a rare disease is still low even after taking the test
and getting a positive result. Such a situation is common in medical tests because it is cru-
cial to identify people with the disease. Giving a positive result to someone without the
disease has in fact a lower cost than giving a negative result to someone with the disease.
■ 9.4 In a Nutshell
Loss Function
The goal of most machine learning algorithms is to minimize a loss function.
It is important to know what the loss function looks like. To minimize a loss
function we need to know calculus and optimization.
Matrices and Vectors
Being familiar with matrices, vectors, and the operations amongst them is
key to understanding how machine learning algorithms work.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
268 9 Mathematics
Matrix Decomposition
Matrix decomposition is key for various reasons. It can be used for dimen-
sionality reduction, and also helps make some algorithms computationally
more efficient and numerically feasible.
Gradient Descent
Gradient descent is an iterative algorithm based on a simple concept: at
every iteration, we follow the direction in which the gradient of the loss
function is steeper.
Constrained Optimization
Sometimes we need to minimize a loss function subject to some constrains.
To do so, we can move to a dual representation of our optimization problem.
Probability Theory
Uncertainty is crucial in machine learning, as it allows us to express how
certain we are about a conclusion. To be able to understand uncertainty,
one needs to be familiar with the basics of probability theory.
Random Variables
Random variables are variables which can take different values according to
a specific probability function.
Bayes’ Theorem
Bayes’ theorem allows us to update our predictions when we receive new
information.
10
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Statistics – Basics
Rania Wazir, Georg Langs, Annalisa Cadonna
A question often arises when one approaches data science for the first time: what is the
difference between statistics and machine learning? You will hear a multitude of different
answers to this question, which can be encapsulated in one statement: the difference
between statistics and machine learning is their purpose.
Statistics focuses on doing inference about relationships between variables. Statistical
models allow us to predict new observations, but this is not its focal point. The purpose of
machine learning, on the other hand, is to make predictions about new observations as
precisely as possible.
Simple models like linear regression or logistic regression are often considered statistical
models. This is particularly the case if we are testing to understand whether a predictor has
a significant effect on the response, or if one model is better than another one. Statistical
models are interpretable, meaning we can understand the role of each parameter in the
model. On the other hand, when machine learning models are used, one often does not care
about interpretation.
Assume a business problem was defined and all the necessary data collected and organized.
Data scientists usually go through three phases before they decide which model should be
put into production.
1
Box, G. E. P.; Luceño, A.; del Carmen Paniagua-Quiñones, M. (2009), Statistical Control By Monitoring and
Adjustment
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
270 10 Statistics – Basics
In the first phase, the data is explored. We ask ourselves questions such as: are there
missing or wrong data points? Are there outliers? What kind of attributes are there, and
how many of them?
In the second phase, the data must be prepared. We ask ourselves the following ques-
tions: how should we handle missing data? How and how do we recognize erroneous data?
How do we deal with outliers? Are they important to the decision process or should they
be removed? Depending on the models we will try out, the data may also need to be
normalized.
Finally, the third phase consists in experimenting with various statistical and machine
learning models.
The three phases described above are not executed once in a data science project, but
multiple times. Why is this the case? For example, we might observe that the model re-
sults depend strongly on the way we handle outliers and missing values, and choose to
prepare data in a different way. Or we might realize that some features could be better engi-
neered.
This chapter is organized as follows. Section 10.1 describes different types of data. In
Section 10.2, Section 10.3, and Section 10.4 we look at simple linear regression, multiple
linear regression, and logistic regression. Finally, Section 10.5 introduces a few models to
evaluate how accurate a model is.
■ 10.1 Data
Data can come in various forms depending on the application and the collection mecha-
nism. The first distinction we can make is between structured and unstructured data. A
simple example of structured data is a spreadsheet: each row of data denotes an observa-
tion; each column refers to a variable. Unstructured data, on the other hand, are for example
text or image data. In statistics, we mainly deal with structured data.
In the case of structured data, it is important to understand what type of variables we are
dealing with. Variables can be divided into quantitative and qualitative variables:
Quantitative variables, also called numeric variables, can be measured and take n umeric
values. Examples of quantitative variables are age, income, value of a piece of property,
and temperature. Quantitative variables can be discrete or continuous. Usually, we are
dealing with a discrete variable if its value answers the question “how many?”, as in,
“How many students are in a class?” or “How many rooms does a house have?” Discrete
continuous variables can have decimals in their value. Usually, the number of digits
depends on the precision of the measurement instrument. For example, one thermometer
could record 37.1°C, another one 37.12°C. A trick to understand whether a variable is
discrete or continuous is to ask if adding a decimal makes sense. Can we say we have 5.5
(alive) elephants? No? Then the number of elephants is discrete.
Qualitative variables, also called categorical, take values in a limited number of catego-
ries: satisfaction ratings in a survey (“very”, “enough” “not at all”), age groups, or the
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.2 Simple Linear Regression 271
brand of a product purchased are just some examples. Qualitative variables can be ordinal
or nominal. Ordinal variables are variables with a “logical”, well understood order. If we
consider the results of a competition, for example, we can say that a second place is better
than a third place. If we consider t-shirt sizes, there is a well understood order: S – M – L
– XL. Nominal variables, on the other hand, do not have an order. Eye color is an example
of a nominal categorical variable. One might like brown eyes more than green eyes, but
we cannot say that brown eyes are better, or higher, or bigger than green eyes.
Statistical machine learning methods can be of two types, depending on the available data:
Supervised learning methods, which are used in cases when we can observe both the
independent and the dependent variables. We can classify such a model these methods
further in
Regression models which are used when the dependent variable is quantitative, and
whose goal is to predict a quantitative value.
Classification models which are used when the dependent variable is qualitative, and
the goal is to predict a class.
Unsupervised learning methods, which are used when we do not have a dependent vari-
able, or our data are not labeled. The most popular method in this context is clustering.
Clustering algorithms look for similarity groups in a data set without knowing exactly
what the underlying “true” groups are. This is useful, for example, in a market segmen
tation problem in which we want to group customers and rank them according to similar
behavior patterns, without having a categorization in advance. If you would like to see an
example, in Chapter 17, Section 17.4.2, you can find an explanation of document cluster-
ing using the K-Means algorithm.
Equation 10.1
where is called the “slope”, is called the “intercept”, and is an error term, centered at
zero, whose variance does not depend on . Introducing the error term means we assume
that our data will not follow a perfect straight line, and will instead be spread around it.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
272 10 Statistics – Basics
Equation 10.2
The loss function in Equation 10.2 is called quadratic loss function. The values of n and m,
which minimize the quadratic loss function, are denoted with m and n and are defined as
In the above equations, and are the sample mean of the -values and of the -values,
respectively, is the sample covariance matrix, and is the covariance matrix of the
independent variable. These quantities are calculated as follows:
With the tools covered in Chapter 9, you should be able to analytically derive the formulas
for the optimal and .
Once the parameters have been estimated, we can evaluate the model performance using a
metric. In regression, the Mean Squared Error (MSE) is often used. The MSE is calculated
as
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.2 Simple Linear Regression 273
Alternatively, instead of the MSE, the Root Mean Squared Error (RMSE) may be used. As the
name suggests, the RMSE is the square root of the MSE. One advantage of the RMSE is that
it is expressed in the same units at the dependent variable. For example, if Y is measured in
meters, then the MSE is measured in square meters, but the RMSE is measured in meters.
Let’s now implement a simple linear regression. The Python library Scikit-learn [1] provides
many statistics and machine learning algorithms. Scikit-learn also contains some datasets.
To run the following lines of code, install Python 3, for instance from https://www.python.
org and then then run: pip install -U scikit-learn. Below, we import the Boston Hous-
ing dataset, which contains data relative to housing in the Boston area. We use MEDV, the
median value of a house in $10000, as dependent variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# load the Boston Housing dataset, which is part of the sklearn datasets
boston = ds.load_boston()
print("Type of imported object: " + str(type(boston)) + "\n")
print("A bunch is a special kind of Python dictionary. \n The key values of the
boston record are: " + str(boston.keys()) + "\n")
print("The short description of the record: " + str(boston.DESCR))
Xarray = boston.data
Y = boston.target
X = pd.DataFrame(Xarray, columns = boston.feature_names)
With the head(), describe() and info() methods we get a better overview of our dataset and
how it is structured. The method head() will show us the first few lines, describe() will give
us some summary information, such as max, min and sample mean, and info() will tell us
which type the variables in the dataset are.
Please run Listing 10.1 and Listing 10.2. We have 506 observations in total, and all our vari-
ables are numeric. Let’s quickly see how many missing values are in the data set.
There are no missing values in this dataset. This will rarely happen in real life applications.
Before we go any further, let’s visualize our data. We use INDUS, the proportion of non-retail
business acres per town, as the independent variable.
Listing 10.4 Simple linear regression 4: Data visualization with two variables
plt.scatter(x = X.INDUS, y = Y, alpha = 0.3)
plt.xlabel("INDUS (in %)")
plt.ylabel("MEDV (in $10000)")
plt.show()
The scatter plot can be seen in Figure 10.1. The relationship between the two variables does
not appear to be completely linear. However, a decreasing trend can be seen in the data.
Taking a close look at the graph in Figure 10.1, we notice that it shows a few peculiarities:
First, the vertical line at 18 % INDUS catches our eye. There are many observations with
INDUS = 18 %, and at that value of INDUS, MEDV takes many different values. One should
investigate the reason behind this with the relevant experts.
Second, we note two outliers at 50 MEDV and 18 % and 20 % INDUS. The y-value of these
data points differs strongly from the y-value of other points with the same or similar
x-value. Are these two outliers due to randomness or is there a story behind it? We usually
need to dig deeper into outliers to know how to deal with them, but we will not investigate
this further here. However, we need to keep in mind that the outliers will affect our loss
function, which treats them as any other data point.
A final observation is that at MEDV = 50 the points form an almost horizontal line. Could
it be that this data set was cut at MEDV=50? How to deal with these values would also
need further exploration.
Before moving on, we split our dataset into training data and test data. This is an essential
part of any data science project and will help us to evaluate the performance of our model.
We set the size of the test set to 20 % of the entire sample size.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.2 Simple Linear Regression 275
Listing 10.5 Simple linear regression 4: Split into training and test data
# Now, let's split the data into train and test sets:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_
state = 18)
oneX_train = X_train.INDUS.values.reshape(-1,1)
oneX_test = X_test.INDUS.values.reshape(-1,1)
And this is how our linear line looks now: y = -0.65x + 29.65. This means that when we in-
crease INDUS by 1 %, we expect a decrease in MEDV of $6500. Let’s lay the regression line
over the data to obtain the plot in Figure 10.2. Notice that this is an absolute change in the
dependent variable, and the percentage is the unit.
plt.xlabel("INDUS")
plt.ylabel("MEDV in $10000")
plt.title("Training Data Results")
plt.show()
Now that we have calculated our simple linear model, we also need to check how well it is
performing.
First, we evaluate it on the training data.
Listing 10.8 Simple Linear Regression 8: Evaluating the Model on Training Data
# Check the results:
lin_mse = mean_squared_error(Y_train, Y_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_mae = mean_absolute_error(Y_train, Y_predictions)
print("The linear regression with one dependent variable has a training RMSE of " +
str(lin_rmse) + ",")
print("and a training MAE of " + str(lin_mae) + ".")
print("Compare this to a Y median value of " + str(medY) + " and a Y maximum value of
" + str(maxY) + ".")
The linear regression with two variables has achieved a training RMSE of 7.98. Compare
that to a median of 21.20 and a maximum of 50.00 of the dependent variable. This doesn’t
look very good, as the RMSE makes up just under 40 % of the median value. Notice that the
model almost always performs better on the training set than on the test set, as it has al-
ready seen the data it is trying to predict. So, how is the performance on the test set?
Listing 10.9 Simple Linear Regression 9: Evaluating the Model on Test Data
Y_predictions = lin_reg.predict(oneX_test)
The linear regression with two predictor variables achieved a test RMSE of 8.28. Anyone
who has followed the preceding calculations to some extent will notice that we calculated
not only the RMSE, but also the MAE. MAE stands for Mean Absolute Error, and it is one of
the metrics we can use to evaluate the performance of our model. The MAE is calculated as
The MAE calculates the sum of the absolute distance of the predictions from the real values
and is less sensitive to outliers than the RMSE. The test MAE for this model is 5.82.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.2 Simple Linear Regression 277
We also calculated the coefficient of determination, also called the R-squared. The R-squared
tells us how well the predicted values of y explain the variation in the observed values. In
contrast to MSE and MAE, a model is “better” if the test R-squared is larger. In this case, the
R-squared is 0.23. This will be relevant in Section 10.3, when we talk about multiple linear
regression Figure 10.3 shows the regression line overlaid on the test data.
One last exploration method we can use is to look at plot of the residuals in Figure 10.4. The
residual plot is a graphical representation of the deviation of the predicted data from the
observed data. The vertical axis always shows the difference between the predicted and
observed y-values. The horizontal axis indicates the x-value for a simple linear regression,
and the predicted y-value for a multiple linear regression (see Section 10.3). The residuals
plot shows us that we have outliers in the residuals, but no other clear pattern is visible.
If you would like to view more statistics for evaluating datasets and models, such as p-values
for the coefficients or the F-statistic, you should use the Python statsmodels [2] package.
It should be noted that the linear regression has not totally failed in this case, but we are
still far from a good result. We could try in further runs to remove the outliers and possibly
also the observations where MEDV is close to $500000. We could also add in the other vari-
ables to do a multiple regression, and we might want to consider using other methods
besides regression. We will now try to build a model that predicts MEDV based on more
than one independent variable.
Equation 10.3
In In Equation 10.3 is the intercept and are the parameters relative to the
predictors .
How do we interpret the parameters in a multiple regression setting? Since we have more
than one regressor, we have a slightly different interpretation: the parameter relative to a
regressor is the expected absolute change in the response variable when we change that
regressor by one, while keeping all the other regressors constant. Notice that this interpre-
tation is valid only for dependent variables that appear in a linear fashion. If we have second
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.3 Multiple Linear Regression 279
order or higher order terms, such as a squared predictor, or an interaction, this interpreta-
tion does not hold. To understand why the interpretation does not hold in the presence of
nonlinearities, imagine we take, for example, In such case, we cannot change
without changing .
For each data point we have a set of one dependent variable and independent variables.
That is, for the i-th data point we have . The loss function is defined as
As we did in the simple linear regression, we can find the values of the parameters that
minimize the loss function. The solution can be found analytically using differential calcu-
lus. However, data scientists usually rely on software like python to find the estimates of the
parameters. Let’s now fit a multiple linear regression model to our Boston housing dataset.
How sure are we that some parameters are not equal to zero? To answer this question, we
need to assume a distribution on the error, usually a Gaussian distribution, and perform
classical hypothesis testing. In the case of multiple linear regression, we must also consider
the F-statistic. We will not go into detail here, but the interested reader will find the appro-
priate formulas for this in any statistics textbook.
We now want to compare the simple linear regression model with the multiple linear regres-
sion model. To do so we could, for example, compare the R-squared or R-square adjusted,
which is the R-squared penalized for the number of predictor variables. Another way is to
evaluate how well the models are doing in making predictions on the test set.
Listing 10.13 Multiple Linear Regression 2: Evaluating the Model on Test Data
# Check the results:
lin_mse_mult = mean_squared_error(Y_test, Y_mult_predictions)
lin_rmse_mult = np.sqrt(lin_mse)
lin_mae_mult = mean_absolute_error(Y_test, Y_mult_predictions)
r2_mult = r2_score(Y_test, Y_mult_predictions)
print("The multiple linear regression has a test RMSE of " + str(lin_rmse_mult) +
",")
print("and a test MAE of " + str(lin_mae_mult) + ".")
print("The coefficient of determination is " + str(r2_mult) + ".")
The multiple linear regression has a test RMSE of 5.26, and a test MAE of 3.63. Both the
RMSE and the MAE are lower for the multiple regression model!
The coefficient of determination (R2) is 0.69, which is much larger than in the simple re-
gression model. Yet this metric should be used with caution, because if the number of inde-
pendent variables in the regression is increased, the coefficient of determination automati-
cally increases as well. To counter this, the adjusted coefficient of determination is often
used, which penalizes for the number of independent variables in the model. To calculate
the adjusted coefficient of determination, have a look at the statsmodels [2] documentation.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
280 10 Statistics – Basics
How sure are we that some parameters are not equal to zero? To answer this question, we
need to assume a distribution on the error e (usually, a Gaussian distribution is assumed)
and perform classical hypothesis testing. In the case of multiple linear regression, we must
also consider the F-statistic. We will not go into detail here, but the interested reader will
find the appropriate formulas for this in any statistics textbook.
Since linear regression was so easy, why don’t we try modeling our data with a straight line
again? What it looks like for our iris dataset can be seen in Figure 10.5.
Figure 10.5 Linear Regression with the Iris dataset: Petal width vs Species
A probability should have a value in the interval [0,1] and the values of a straight line
clearly do not.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
282 10 Statistics – Basics
Ideally, we would want our model to predict the probability that a flower is an iris setosa.
Then, one can say that the flower is predicted to be an iris setosa if this probability is greater
than 0.5, for example.
To achieve our goal, we use a logistic transformation. The logistic function should be famil-
iar to anyone who wants to work in data science. It is also used as an activation function in
neural networks. The logistic function looks like
Equation 10.4
The interpretation of the parameters in Equation 10.4 is not as immediate as in linear re-
gression. If m is positive, the probability of increases with and if is negative, the
probability decreases with increasing . Moreover, the strength of this influence depends
on the size of . When is binary, one way of looking at the parameter m is that of log-odds
ratio. The odds ratio is defined as the ratio of the odds of a success when the binary variable
is one and the odds of success when the binary variable is zero.
Since we are now working with probabilities, the MSE is no longer as useful as a loss func-
tion. Instead, we want to maximize something called the log-likelihood function. Doing so,
we find the parameter values that, when plugged into the model, are most likely to have
generated the data. Considering the observations , the log-likelihood func-
tion is defined as
How to optimize this function to the parameters m and n cannot be explained in detail in
this chapter, but we can at least try to understand the result. So let’s fit a logistic regression.
We obtain a coefficient of -4.53. This can be interpreted as: the larger the petal width, the
lower the probability that the flower is an iris setosa.
The interpretation of a parameter relative to a predictor is the same as in the simple logistic
regression model when the other predictors are kept constant. Like in logistic regression,
this is not valid in the presence of nonlinearities in the independent variables.
Listing 10.16 Logistic Regression 3: Import and Summarize U.S. Census Data
# Import the libraries needed
import os
import pandas as pd
import numpy as np
# Other libraries we will need
import seaborn as sns # for creating a heatmap of correlated variables
import matplotlib.pyplot as plt # for creating plots of the data
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score
import itertools
# Because the CSV doesn't contain any header, we add colum names
# using the description from the original dataset website
# Get to know the data with head, describe, and info methods
censusData.columns = ["Age", "WorkClass", "FnlWgt",
"Education", "EducationNum", "MaritalStatus",
"Occupation", "Relationship", "Race", "Gender",
"CapitalGain", "CapitalLoss", "HoursPerWeek",
"NativeCountry", "Income"]
print(censusData.head())
print(censusData.describe())
print(censusData.info())
Logistic regression can work with categorical attributes, but to do so, these must first be
converted into “dummy variables”. Here, we will consider only the numerical variables.
Next, we plot at the correlation matrix. You can see this matrix in Figure 10.6. The correla-
tion matrix simply gives us an idea on the correlation between variables. Looking at the
correlation matrix is important to identify if some independent variables are highly cor-
related with each other. However, the correlation matrix will not show more when one vari-
able is a linear combination of two or more other variables. Here, we will consider only the
numerical predictor variables.
And now we are ready to go. We start by splitting the data into train and test sets and fit the
multiple logistic regression model.
The code has calculated for us the coefficients of the independent variables. From these
coefficients, we can see how great the influence of each feature is, and whether it has a
magnifying or reducing influence on the target variable.
Evaluation
How can classification model be evaluated? The starting point for the valuation is the so-
called confusion matrix. Let us define as positive the observations for which the dependent
variable is one and negative the ones for which the dependent variable is zero. True Posi-
tives (TN) are the positive observations which are correctly predicted as positive, and True
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.4 Logistic Regression 285
Negatives (TN) are negative observations which are correctly predicted as negative. On the
other hand, False Positives (FP) are negative observations that are predicted as positive, and
a False Negatives (FN) are positive observations that are falsely predicted as negative. Keep-
ing this in mind, we can build the confusion matrix as follow.
PREDICTED LABELS
Positive Negative
ACTUAL LABELS Positive
Negative
With the help of the confusion matrix, we can define the following metrics to evaluate a
classifier:
The success rate or accuracy is calculated as the number of observations correctly iden-
tifies as positive or negative, that is .
Sensitivity is the percent of positives that were also predicted to be positives, that is
. It is also called True Positive Rate or Recall.
Specificity is percent of true negatives detected as such by the algorithm, that is
. It is also called True Negative Rate.
Precision is the percentage of true positives out of all observations predicted to be
positive, that is .
And finally, the F-score, which seeks to combine the information from Precision and
Recall into one number: .
We now predict the class of the observations in the test set and plot the confusion matrix in
Figure 10.7.
Listing 10.19 Regression 6: U.S. Census Data Prediction and Confusion Matrix
Y_pred = log_reg.predict(X_test)
cnf_matrix = confusion_matrix(Y_test, Y_pred)
sns.heatmap(cnf_matrix, annot=True, fmt="d")
igure 10.7
F
Logistic regression, confusion matrix
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
286 10 Statistics – Basics
Notice that by using the predict() method, we will directly obtain the predicted class. When
we estimate the parameters of the logistic regression, however, we have access to even more
detailed information: the probability that the sample belongs to the positive class. If we
want to obtain this probability, we will need to use predict_probs(). The predict() method by
default uses a threshold of 0.5, which means that every observation with a probability above
50 % of belonging to the positive class will be assigned to that class.
In many applications, not all errors are to be valued equally, however: for example, in a
medical diagnostic procedure, one might prefer to have one more false positive than a false
negative, as the latter means missing a true positive diagnosis. In such cases, one could
reduce the “cost” of a false positive. For classifiers that calculate probabilities, as for logistic
regression, this can be done by moving the classification threshold to account for the cost of
false positives and false negatives. Yet we can never really minimize false positives and false
negatives; there is always a tradeoff. To understand why, perform the following thought ex-
periment. Imagine an algorithm that classifies all observations as positives. You would have
100 % true positives, but also 100 % false positives. On the other hand, imagine that you
classify all events as negative. This would result in 0 % false positives, but also 0 % true
positives.
The Receiver Operating Characteristic (ROC) curve is quite useful to illustrate the com-
promise between True Positives and False Positives. A ROC curve is a graph with the True
Positive Rate on the y-axis, and False Positive Rate on the x-axis. Going back to the two algo-
rithms above, the ROC curve connects the two points (0,0) and (1, 1). A good model has a
ROC curve that comes as close as possible to the left corner, that is the ideal situation (100 %
True Positives, 0 % False Positives). We can move along the ROC curve by changing the
threshold that we use to assign observation to the positive class. We can then choose the
threshold based on what error costs more in our application, and how much more.
Let’s calculate the ROC curve of our logistic model and plot it. The ROC curve can be seen in
Figure 10.8.
# The roc_curve metric returns the False Positive Rate (FPR), the True Positive Rate
(TPR),
# and the Thresholds (the cutoff probability for determining the class), which we
don't need here.
We can calculate one more metric for the logistic regression model: the Area Under the
Curve (AUC). The area under a curve is a measure of how well a classifier is performing
compared to classifications which were generated randomly. If the AUC is close to the 45°
line (dark blue dashed), the AUC is only 0.5 and the binary classifier is not better than sim-
ple guessing. The more the curve tends to the top left corner, the closer the AUC is to 1. The
AUC for our model is 0.58.
But how can we decide how well an algorithm has worked on the data at hand? If we use the
value of the loss function on the training data, we run the risk of evaluating our model too
optimistically. After all, the loss function was optimized specifically for the selected data!
We need to test our model with other data that we’ve never seen before to get a feel for how
well our model will then perform “live”. And that’s why we split our dataset into training
and testing data. In machine learning, the data is often split into training, validation, and
test sets. The validation set is used during development to tune the hyperparameters of the
model, which are parameters that are set by us and not estimated by the algorithm. For
example, in binary classification problems, the threshold can be considered a hyperpa
rameter. Hence, we fit the model using the training data, fine tune the threshold using the
validation data, and then obtain the final performance metric on the test set.
Often, we want to use a more robust approach. The most popular approaches are cross-val-
idation and bootstrapping. There are different possibilities of cross-validation, but we will
concentrate on the k-fold cross-validation and set for convenience k = 5. Thus, the training
dataset is divided by random sampling into five subgroups of approximately equal size.
Imagine a large pizza divided into five pieces, and mark the top piece as 1, and further
clockwise as 2, 3, 4, and 5. In the first pass, take data group 1 as validation data and train
with the remaining data (groups 2, 3, 4, 5). In the second pass, the second data group is set
aside for validation and the algorithm trains with the other data groups (1, 3, 4, 5). One
continues in such a way until all five data groups have served exactly once as validation
data. Then you have five test metrics (for example AUC for classification and RMSE for re-
gression) which can be used to compare the different algorithms. In bootstrapping, you take
the training data via random sampling with replacement. Thus, the same observation can
be drawn several times, and other observations may not appear in the training data at all.
This statistical fact is exploited by bootstrapping: We continually draw samples from the
training data until we have a new training data set of the same size. The observations which
were never drawn in this procedure go into the validation data set. The validation results are
used to compare the different algorithms. To reduce the variance of the validation results,
cross-validation or bootstrapping is often performed several times, and the mean of the
performance metrics over each trial is used to evaluate the model. Regardless of whether
bootstrapping or cross-validation is used, after the best algorithm has been selected, train-
ing is again performed on the entire training data. The model is then tested on the test data
to make a statement about how well it works in general.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
10.6 In a Nutshell 289
■ 10.6 In a Nutshell
Types of Data
Data can be categorized as structured (tables) or unstructured (test, images,
etc.). Structured data contain variables that can be quantitative or qualitative.
Linear Regression
In linear regression, the dependent variable is quantitative. The interpretation
of the parameter relative to an independent variable is the expected absolute
change in the dependent variable when we change the independent variable
by one unit. The MSE, RMSE and MAE are metrics used to evaluate how good
a linear regression model is.
Logistic Regression
In logistic regression, the dependent variable is binary, that is takes only two
values, often coded with zero (negative) and one (positive). The interpretation
of the parameters is not as immediate as in linear regression but gives an
indication of the effect of the independent variables on the response. AUC
gives an idea of how good a model is. Depending on the importance of the
classification errors in the specific applications, other metrics are used.
Cross-validation and Bootstrapping
Cross-validation and bootstrapping are two techniques to obtain a more
robust measurement of the performance of a model. Both consist in
calculating many metrics on subsets of the training data. The final
performance metric is then calculated on the test set.
References
[1] Pedregosa et al. Scikit-learn: Machine Learning in Python (2011), JMLR 12, pp. 2825-2830
[2] Seabold, Skipper, and Josef Perktold. statsmodels: Econometric and statistical modeling with
python. Proceedings of the 9th Python in Science Conference (2010)
[3] Fisher, R.A. The use of multiple measurements in taxonomic problems, Annual Eugenics (1936), 7,
Part II, 179-188
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Business Intelligence
(BI)
Christian Mata
“The real value of data is not in the data itself but in the
insights and decisions that can be made from it.”
Jeff Bezos, CEO of Amazon.com
What is business intelligence (BI) and what role does it play in organizations?
How has BI developed over time and what is it like today?
How does BI enable decision-making and why is business context essential
to BI?
How do BI, data management, data integration, and data warehousing fit
together?
What is ETL and what is ELT and which approach is relevant to BI?
What do you need to know about data modelling, its abstraction layers, and
different approaches for designing data structures for decision support?
What is the role of reporting in BI?
What types of reporting are common and what are their differences?
What do the different types of data analysis focus on and what is visual
analysis useful for?
What technologies and tools are relevant to BI and are commonly used to
manage an organization’s information, including some examples?
What are the differences and similarities between BI and data science, and
how do they complement each other to increase the value of data to the
business?
What can be expected from the evolution of BI and how do data science
and AI fit in?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
292 11 Business Intelligence (BI)
In this chapter we will explore the discipline of business intelligence (BI), which is well es-
tablished in many organizations. We will build an understanding of what BI is, the role it
plays in modern companies, and how it intersects with the field of data science.
For data scientists, understanding BI is not just about knowing its tools and techniques, but
also about appreciating its important operational and strategic role in the business. Recog-
nizing how their data science skills can complement and enhance BI can have a big impact
on an organization compared to working independently (of an existing BI team).
Improved collaboration between BI and data science professionals will be a driving force in
moving businesses forward through data in the context of the necessary transformation and
change that organizations are facing today.
Data science: In contrast, DS extends its reach to both structured and un-
structured data, covering a broader spectrum of data sources, including text,
email, documents, images, audio, and video. This allows DS to extract insights
from complex and diverse datasets using advanced algorithms, machine
learning, and statistical methods. Data science also uses unstructured data
to uncover patterns and insights that are not readily apparent in structured
datasets alone. As an indication of the volume of data involved, it is commonly
accepted that most of the data stored in organisations is unstructured. Estima-
tions say unstructured data may account for 80-90% of stored company data.
Temporal, time perspective Ô mainly descriptive analysis with past and
current data vs. mainly predictive analysis about future data:
Business intelligence: BI is primarily concerned with reporting on past and
present states, providing a retrospective view of business performance. It
tracks key performance indicators (KPIs), monitors trends over time and pro-
vides pre-defined information based on data that has happened. These facts
are often put into context using business plans, budgets, pre-defined forecasts,
or various comparisons based on time periods, benchmarks, or competitor
performance. This retrospective and comparative approach is essential for
understanding what has happened and what is happening now in the business
landscape. It gives businesspeople proactive information for business planning
and allows them to monitor if it is going to be a good day or if urgent reactions
or decisions need to be made.
Data science: Data science, on the other hand, is generally concerned with
a forward-looking perspective, using predictive models and algorithms to fore-
cast future trends and outcomes. Beyond prediction, DS also deals with pre-
scriptive analysis, providing recommendations on possible courses of action to
achieve goals or mitigate future risks. This predictive and prescriptive capability
positions DS as a tool for strategic planning, risk management and innovation,
guiding businesses toward a future shaped by data-informed decisions and
actions.
A classic definition of the term “business intelligence” comes from Howard Dresner, who
described BI in 1989 as “a set of concepts and methods to improve business decision-mak-
ing by using fact-based support systems”.1 In this definition he broadened the scope of BI
beyond just reporting to encompass a wide range of tools and processes intended to analyze
and manage business data. This definition reflects a current understanding of BI as a tech-
nology-driven process, including data mining, online analytical processing (OLAP), query-
ing, and reporting.
A newer well-known definition from market research firm Forrester Research says: “Busi-
ness intelligence is a set of methodologies, processes, architectures, and technologies that
transform raw data into meaningful and useful information used to enable more effective
strategic, tactical, and operational insights and decision-making.”2
Decision support
BI plays a central role in providing historical and current views of business operations and
supporting better decision-making across different levels of an organization. Examples in-
clude decisions about attracting and supporting customers, selecting suppliers, making
sustainable choices in materials sourcing, securing energy contracts in line with demand
forecasts from a production program, planning investments and calculating the ROI for
power generation assets, or quantifying carbon emissions that can be avoided or reduced—
data ready for analysis which answers relevant business questions is needed for many deci-
sions.
Operational improvement
The analysis of data provided by BI helps to identify areas for improving business perfor-
mance, such as revenue growth, cost reduction or process improvement. Examples include
improving revenue by analyzing customer and sales data for targeted sales initiatives and
personalized marketing campaigns, improving costs by optimizing inventory and balancing
supplies, finding areas to save on energy consumption, or reducing downtime of production
plants and assets by planning outages and predictive maintenance.
Business planning
BI reports and insights derived from it are important for business planning, providing a
basis for formulating future strategies and goals. Examples include data on market trends,
competitor performance, customer feedback, behavior, and price sensitivity to guide prod-
uct development, marketing, and production quantities. It also includes historical sales pat-
1
D. J. Power (2007): A Brief History of Decision Support Systems, https://dssresources.com/history/dsshistory.
html
2
Boris Evelson (2008): Topic Overview: Business Intelligence, Forrester Research Inc., https://www.forrester.com/
report/Topic-Overview-Business-Intelligence/RES39218
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.1 Introduction to Business Intelligence 295
terns, forecasts from salespeople, and assessments of environmental impact and sustain-
ability objectives.
The concept of BI has been around for decades, evolving from simple data collection and
reporting to much more accessible and sophisticated data analysis and visualization wher-
ever you go. In the history of business intelligence, a series of developments contributed to
the growth and sophistication of BI.
Here are major milestones that highlight the evolution of BI:
Figure 11.1 Business intelligence’s development phases (from 1958 to the 2020’s)
1958 – The concept “Business Intelligence” The concept of business intelligence as we un-
derstand it today, which involves using technology to gather and analyze data, was first
described in 1958 by Hans Peter Luhn, an IBM researcher, in an article titled “A Business
Intelligence System”.3
1960s and 1970s – Emergence of Decision Support Systems (DSS) Decision Support Systems
emerged as a response to the growing need for computer-based systems that could assist
managers and business professionals in decision-making processes. The development and
evolution of DSS started to become notable during the 1970s as advancements in computer
technology and information systems expanded their capabilities and applications in various
industries. These systems marked a shift from simple data processing to more analytical
approaches with data analysis, modelling, and simulation.
1980s – Executive Information Systems (EIS) EIS emerged as a form of DSS, focusing on
providing high-level executives with easy access to internal and external information rele-
vant to their critical success factors.
Late 1980s to Early 1990s – Relational databases The introduction of relational databases by
companies like Oracle, IBM, and Microsoft allowed for easier access and management of
data, enhancing BI capabilities.
1990s – Data Warehousing and OLAP The development of data warehousing and online an-
alytical processing (OLAP) enabled companies to store larger volumes of data for use in
analytical queries and analyses. During this era, well-known vendors such as SAP, Oracle,
and IBM grew strongly.
3
Luhn, H. P. (1958): A Business Intelligence System, https://ieeexplore.ieee.org/document/5392644
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
296 11 Business Intelligence (BI)
Early 2000s – Business Performance Management (BPM) The era of Business Performance
Management introduced the concepts of monitoring and managing an organization’s perfor-
mance according to key performance indicators (KPIs).
Mid-2000s – Proliferation of Self-Service BI Tools such as Tableau and QlikView improved
data access by enabling end users to perform data analysis without extensive technical
skills, enabling more self-service BI in organizations.
2010s – Big Data and Advanced Analytics The big data revolution brought new challenges
and opportunities in handling vast amounts of data. Technologies like Apache Hadoop and
Spark facilitated the processing of big data, and advanced analytics began to incorporate
predictive modelling and machine learning.
Late 2010s to Present – AI and Machine Learning Integration Modern BI tools have started
integrating AI and machine learning into their systems, offering more sophisticated and
augmented analytics capabilities, including predictive and prescriptive analytics.
Cloud Computing and BI – The use of cloud computing in BI has enabled more scalable and
flexible BI solutions, as well as accessibility benefits, as more companies move to cloud-
based BI platforms.
Generative AI emergence
The advent of generative artificial intelligence (GenAI) and its increasing integration into BI
and data science workflows will transform the field of business intelligence, particularly the
activities performed, and the way professionals will work. It will improve the accessibility
of relevant facts to different stakeholders through natural language, speed up the day-to-day
data analysis and assist in the automation of routine tasks such as data cleansing and
preparation. This opens the potential for analysts and data scientists to focus on more com-
plex business scenarios and spend more time on higher-level work of strategic value to the
organization and supporting business users in their use of data for decisions and actions.
The relationship between data science (DS)/AI and business intelligence (BI)
can be looked at from (at least) two seemingly opposing perspectives.
Depending on the perspective, the other discipline may be “just a part” of
the “bigger picture” of the “main” one. The purpose of this aside is not to find
or claim to have “the” answer, or to prove one side or the other right or
wrong. The aim is to raise awareness and improve mutual understanding of
the viewpoints of people from the other data discipline, so that the discussion
can move on to finding the right way forward, whatever the current situation,
initial viewpoint, and organization.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
298 11 Business Intelligence (BI)
Insight translation: BI tools and practices play a key role in making data
science (and AI) insights accessible to non-technical stakeholders, bridging
the gap between their complex analyses and practical business use.
So, which of these topics do you recognize in your work environment?
I will deliberately not add my conclusion on this topic as it would be biased
due to my predominant experience in BI. However, I do believe that a dis-
course on the best way to approach the interplay of BI and DS, considering
the different viewpoints mentioned, the history of the disciplines within the
organization, and certainly the future business and data strategy, is needed.
This will enable management to position the data team organization for future
challenges and opportunities at this exciting point in time. From this unique
position developed, a competitive advantage through data can be achieved.
To avoid false conclusions, I would like to add that I personally tend towards a
convergence of the data disciplines, namely BI, AI and DS, which I will expand
on in section 11.4.
The ability to consistently make informed decisions based on data is a success factor in to-
day’s business environment and business intelligence (BI), if done right, enables people to
do this.
BI as an enabler of data-driven decisions
Insight generation BI tools and processes help to turn raw data into relevant, contextual in-
formation that provides a clear understanding of the different aspects of the business. This
information can then be distilled to generate new insights. Examples include the buying
patterns of customer segments, a comprehensive view of the supply chain with its opportu-
nities, risks and delays, and the monitoring of sales and customer feedback to inform prod-
uct management.
Timely information By offering reports, dashboards, and analyses with up-to-date informa-
tion at the required frequency, BI helps companies respond quickly to changes in the mar-
ket, the value chain, or within the company. Examples include up-to-date views of key per-
formance indicators (KPIs), sales transactions and orders, production output, inventory
levels, supply chain issues, competitor activities such as product launches and pricing
moves, as well as continuous monitoring of revenue, actual costs to budget, and current and
expected cash flow.
Democratization of data BI tools make data increasingly accessible and understandable to
non-technical business users. This means more people can use the information they need to
make their daily decisions. Examples include BI capabilities built directly into business
applications; interactive dashboards that visualize information relevant to their role; collab-
orative BI platforms that allow users to share information, comments, and insights; and
mobile BI applications that give decision-makers access to critical information anytime,
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
300 11 Business Intelligence (BI)
anywhere. In addition, self-service BI enables non-technical staff to create their own reports
and analyses without relying on IT, fostering a culture of data-driven decision-making
across the business.
Predictive analysis Advanced BI systems integrate predictive analysis to forecast future
events, enabling business users to become more proactive in their business choices. Exam-
ples of predictive analytics include sales forecasting to help companies plan production,
inventory, staffing and marketing; customer churn prediction based on customer data to
enable targeted retention strategies; demand forecasting to predict future demand for en-
ergy, products and services; or predictive maintenance to avoid failures and the resulting
unavailability of equipment or infrastructure.
Next, we will look at the key performance indicators (KPIs) and metrics that measure busi-
ness performance and form the basis for strategic and operational decisions.
The real power of BI is not in the data itself, but in the data in its context, where it is inter-
preted and utilized. The importance of understanding business context and domain knowl-
edge cannot be overstated. Its effective use transforms BI from a simple data processing
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.1 Introduction to Business Intelligence 301
activity, producing numbers to report on, into a strategic tool for decision makers that con-
tributes significantly to business objectives. Therefore, a good grasp on the business context
is just as important for BI professionals as their technical data handling and analysis skills—
another parallel with data scientists. It should also be added that the insights generated are
still not creating business value. In most organizations that don’t monetize their data, the
value of data is only realized when it is used to change the decisions and actions that result.
Contextualizing data
A key part of business intelligence is to transform (mostly structured) data from various
sources into meaningful findings and insights for the business. For instance, analyzing
customer feedback, survey responses, and support interactions can reveal customer satis-
faction levels and areas for product or service improvement. The significance of the signals
initially identified in data analysis is highly dependent on the business context, that is, the
specific conditions and circumstances in which a company operates. Examples include in-
dustry and market trends, economic conditions, and the competitive landscape; characteris-
tics of the customer base, such as age, location, and buying preferences; operational con-
straints, like production capacity, supply chain logistics, and resource availability; and the
regulatory environment of the industry, to ensure that data projects and their products
comply with applicable laws and regulations.
Decision-making background
The business context provides a framework for decision-making, helping to prioritize what
data to analyze and where to look for new insights. Examples of such background informa-
tion include regulatory compliance requirements, market trends and competitor activities,
benchmarking with industry peers, risk assessments and mitigation strategies, or customer
preferences and behaviors.
relevant and focused on the needs of the business users. Professionals with strong domain
knowledge can bridge the gap between original data, actionable information, and decisions.
They can translate often complicated datasets into clear products that reveal the right
course of action and can be presented in a compelling way to promote decisions.
11.1.6.3 Insight
An “insight” refers to a valuable piece of information or deep understanding that is derived
from analyzing data that provides clarity, reveals underlying patterns, or uncovers hidden
relationships and correlations that were not previously apparent.
Insights are the result of processing and analyzing data sets of varying size and complexity
to identify trends, anomalies, relationships, and performance metrics in its business con-
text, often with reference to relevant comparative figures.
Unlike the usual information found in reports, an insight is like finding a hidden treasure
in the data that can reveal a new way to improve or solve a problem. It provides valuable
information and conclusions that can guide key decisions and actions, influence strategy
and lead to business improvements. For example, insights can help companies to optimize
their operations, improve the customer experience, identify new business opportunities, or
mitigate risks.
Business intelligence professionals help their organizations and business users bridge the
gap between data and decisions. They look at the whole lifecycle of data, from collection and
processing to visualization, reporting and communication, to ensure that business issues
can be addressed with accurate and timely information.
To that extent BI professionals engage in a variety of tasks and activities to ensure effective
data management and utilization. The following list gives an overview of typical activities,
while not intended to be exhaustive. These activities can be broadly categorized into project
setup, platform provisioning, development of data integration, reporting and dashboarding,
ongoing operations and communicating the data to others.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.1 Introduction to Business Intelligence 303
Project setup
Needs analysis: Understanding business objectives and requirements of what is needed.
Solution design: Defining how BI solutions and products should be delivered, including
data architecture and data models, data integration processes, and the design of resulting
data products (e. g., reports, dashboards, data services).
Project planning: Setting timelines, allocating resources and budgets, breaking down the
work to be done into manageable chunks, and defining the tasks, iterations, or phases to
complete them.
BI platform provisioning
Platform selection: Choosing appropriate BI tools and platforms based on organizational
needs.
Configuration and customization: Setting up and customizing BI platforms as the founda-
tion for delivering data products within the specific organizational setting and related
standards.
Security and compliance: Ensuring data security, privacy, and compliance with relevant
regulations.
Data analysis
Data analysis: Gaining a thorough understanding of how business activities are repre-
sented in the company’s data, including evaluating business performance for reporting
purposes. It focuses on answering concrete business questions, discovering new informa-
tion, and narrating it into insights that need to be communicated to key stakeholders.
Data analysis also provides input for data modelling and solution design, as well as data
integration activities.
Operational activities
Data refresh and maintenance: Regularly updating data sets and data products such as
reports and dashboards, in addition to maintaining the overall health of BI systems.
User support and training: Providing support to users for the BI solutions deployed, in-
cluding the data provisioned, training on the tools leveraged and conducting training
workshops to enhance data literacy. These efforts also support self-service BI initiatives.
Continuous improvement: Regularly review and improve BI processes and systems for
efficiency, accuracy, and usability, including monitoring of data quality, usage of the data
products (check for published content that doesn’t get used for a significant period of time),
data integration processes and load performance, and data query response and display
times (e. g., on dashboards).
Data storytelling
Data storytelling is the process of creating compelling stories around data findings, using
a mix of narrative techniques, data visualization and context to highlight key insights. It
is not just about presenting data, but about making it relevant and relatable to stakehold-
ers. It recommends not talking about the data itself, but rather the people in the context
of the data, focusing on the impact on business strategy and operations. Data storytelling
ensures that the insights derived from data analysis are communicated in a way that
increases the likelihood of positive decisions and resulting actions that lead to tangible
business results. Presenters should pay attention to how the human brain processes new
facts from a report or presentation. To effectively process and retain new information
while conserving energy, the brain engages different regions in a specific sequence. Ini-
tially, the brainstem plays a crucial role in regulating overall alertness, ensuring that we
are in an optimal state of arousal for engaging with new data, without becoming over-
whelmed. This sets the stage for more complex processing. The limbic system, with the
amygdala playing a key role, then adds emotional significance to the information, which
enhances its memorability and potential impact on our decision-making processes. Fi-
nally, the cerebrum, responsible for higher-order functions such as planning and reason-
ing, thoroughly integrates this information, applying complex cognitive processes. By
presenting data clearly, logically structured, and emotionally engaging, we can greatly
enhance understanding, memory retention, and the practical application of insights, lead-
ing to more effective decision-making.
relevant, and of sufficient quality for downstream use in analysis, reporting and business
decision-making.
11.2.1 W
hat is Data Management, Data Integration and
Data Warehousing?
Data management
Data management covers the practices, architectural techniques, and tools used to achieve
consistent access to and delivery of data across the spectrum of data creation, preservation,
and archiving. It also ensures data quality, accessibility, and security. Robust data manage-
ment is the foundation for reporting, compliance, and efficient analysis to support deci-
sion-making across the organization.
Data integration
Data integration involves collecting, consolidating, and harmonizing data from disparate
sources to provide a complete and accurate view of the business for analysis and reporting.
It also includes tasks such as data cleaning and cataloging to make sure that the data is
accurate, consistent, and readily accessible. Data integration involves a variety of tech-
niques, including data loading via ETL (Extract, Transform, Load) or ELT (Extract, Load,
Transform), data virtualization, replication, and federation, to integrate data in real-time or
batch modes, depending on the business requirements.
4
Inmon, W. H. (1996): Building the Data Warehouse. John Wiley & Sons
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
306 11 Business Intelligence (BI)
Figure 11.2 Data Warehouse simplified—from data sources to the DW for different data uses
The objective of data warehousing is the provision of reliable, historical, and current infor-
mation. Information from the DW is typically being used within the organization, however
it is not uncommon that a DW is additionally serving key stakeholders in the value chain,
such as suppliers, business partners or even customers (through dedicated portals).
Extract: Data is collected from various source systems, such as ERP and CRM systems,
financial applications, other operational systems, and external sources. How this data is
extracted depends on the established interfaces between the systems and the agreed data
access methods (e. g., read from a database or API, or file transfer).
Transform: The extracted data is cleansed, enriched, transformed, and loaded into a format
suitable for analysis and querying.
Load: The transformed data is then loaded into the target system, such as a data warehouse.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.2 Data Management Fundamentals 307
ETL and ELT—as shown in Figure 11.3—are data integration approaches that enable orga
nizations to consolidate data from multiple sources into a dedicated system for further anal-
ysis. The order of the data processing steps can be different based on the preferred data
integration method, platform, or tool in use. In recent years, the ELT paradigm has become
more common, particularly due to the trend of data lakes, data lakehouses, and cloud data
warehouses. In this paradigm, data is extracted and loaded directly into the target system,
where it is stored and further transformed into a data model for analysis. The transform step
(T) of the ELT process uses powerful computing resources. The ELT approach is predomi-
nant in organizations using cloud infrastructure.
It is worth noting that ELT is not a new approach. In the early 2000s, Oracle-based data
warehouses commonly used this solution architecture for data integration. These data ware-
houses (DWs) used SQL operations within the database to take advantage of strong query
performance and stored procedure functionality. However, the increasing need to better
distribute data loads and query workloads on DW systems that run 24/7, as well as faster
development, has made the ETL approach, using tools such as Informatica or Data Stage,
more popular.
The emergence of big data and its open-source technologies, coupled with more affordable
storage options, has expanded the capabilities to meet the increasing demands for data. It
has become possible to add large volumes of unstructured data, such as text and image data,
enabling new use cases for businesses. To address the three Vs of big data (volume, velocity,
variety), the approach has shifted away from “schema-on-write”, where the data structure is
determined before the data is loaded, to the “schema-on-read” model. The latter approach
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
308 11 Business Intelligence (BI)
aims to efficiently process raw data with minimal effort to ingest data and make it available.
The subsequent use of this data involves separate (cloud) infrastructure components for
data storage and data processing (compute), enabling parallel workloads in the data pipe-
line.
Figure 11.4 Data modeling perspectives: conceptual (CDM), logical (LDM) and physical (PDM) level
Figure 11.5 Conceptual data model for sales orders—ER diagram (CDM)
Figure 11.6 Logical data model for sales orders—ER diagram (LDM)
Figure 11.7 Physical data model for sales orders—ER diagram (PDM)
Dimensional modeling
Dimensional modelling is an approach to modelling the data structures of a data warehouse
(DW) introduced by Ralph Kimball. It emphasizes simplicity and ease of understanding by
organizing data into fact and dimensional tables in the so-called Star Schema (see Figure
11.8). This approach is widely used in BI and DW and is known to support efficient query
and analysis.
Star Schema
A popular data architecture pattern in BI, the Star Schema is a certain form of data model
used for access layers of a DW architecture, that are leveraged by BI front-end tools for flex-
ible reporting and analytics. It is designed for query efficiency in BI applications, allowing
fast data retrieval on large data volumes stored in a (relational) database. The star schema
is a data structure that is often represented in a star-like shape, with a central fact table and
its related dimension tables surrounding it. The ER diagram in Figure 11.8 shows an exam-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
312 11 Business Intelligence (BI)
ple of sales order transactions. In a single star schema, the relationships between each
dimension table (customer, product) and the fact table (sales order) are one-to-many. This
central table holds the “facts”, that is the metrics (also called measures) such as sales
amount, order quantity, etc. The fact table is surrounded by the dimension tables, which
detail the “who”, “what”, “where”, and “when”—for example customer, product, store loca-
tion and order date. A star schema generally represents a business process. This straight
forward data structure makes the data model easier to understand and use, and faster to
analyze, optimizing data access and query performance to support efficient data usage. In
this data model, denormalization (non-redundant storage of data) is applied to the dimen-
sion tables, which generally take up much less storage space than the fact tables (e. g., facts
from transactions in the millions or more vs. dimension entries such as products or custom-
ers in the thousands). Answering business questions requires filtering data in one or more
dimensions first, to then summarize the measures from the fact table, e. g., asking for Sales
in January 2024 for customer segment Enterprise in the product category “book”.
Figure 11.8 Star schema data model for sales orders—ER diagram (LDM)
Data Vault
The Data Vault methodology of data modelling—in its current version 2.0—is suited for han-
dling large and complex data sets and building (enterprise) data warehouses. Data Vault is
designed for scalable, flexible, and maintainable data warehouses, that adapt to changing
business requirements while ensuring that data integrity and history are preserved.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.2 Data Management Fundamentals 313
In Data Vault the data is organized using the concepts of “hubs”, “links”, and “satellites”,
which give Data Vault data models its flexibility.
Hubs in a Data Vault model act as central reference points, capturing the unique business
concepts or entities through their key identifiers.
Links connect different hubs, representing relationships and transactions between the
core business entities.
Satellites enrich the data model by storing descriptive attributes, historical changes, and
contextual details associated with hubs and links.
In a Data Vault architecture, the Raw Vault and the Business Vault represent two distinct
layers that serve different purposes and separate the concerns of data collection and data
use.
The Raw Vault is a foundational layer that stores data in its original, unaltered form di-
rectly from source systems (without any transformations), ensuring a historical, trace-
able, and auditable record of the data changes over time.
The Business Vault builds on the Raw Vault by applying business context and rules to
the raw data. The added business logic transforms its data into an analysis-ready layer
with relevant information for business analysis and decision-making.
The entity relationship diagram in Figure 11.9 shows our modelling example in a possible
data vault setup at a conceptual level.
The entity relationship diagram in Figure 11.10 shows the data vault example in its phys-
ical data model.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
314 11 Business Intelligence (BI)
11.3.1 Reporting
Reporting is the process of collecting, organizing, and presenting data in a structured for-
mat, often as reports, to communicate the state of a business at a given point in time.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.3 Reporting and Data Analysis 315
Reports are used for different purposes and are produced regularly (daily, weekly, monthly)
or ad hoc. Reporting is a core part of BI, which puts data into meaningful summaries to
monitor and analyze business performance.
For different usage scenarios, different types of reporting ensure that the information is
effectively communicated to people at different levels of the organization, supporting both
day-to-day operations and long-term strategic goals.
Management reporting
The goal of management reporting is to provide a comprehensive view of an organization’s
performance to assist executives and managers in overseeing business performance, en-
abling informed operational and strategic decisions and actions, and to support business
planning. Management reports combine summaries of operational data, analytical insights,
performance metrics against targets, and KPIs. They are typically produced on a monthly,
quarterly, or annual basis and are used in board meetings, strategy sessions and manage-
ment reviews to set the overall direction of the organization, and to report on progress at
regular departmental meetings. The sales dashboard shown in Figure 11.11 provides a clear
overview of actual vs. planned revenue and profit.5 It uses the visualization concepts of the
International Business Communication Standards (IBCS), an information design notation
that helps to ensure consistent visual representation of information (e. g., actual values
presented differently from plan values, or absolute values presented differently from per-
centages), something we take for granted in other areas such as reading a map or sheet
music. IBCS is already widely used in (but not limited to) financial reporting.
Figure 11.11 Sales management report in IBCS notation, comparing actual revenue vs. plan
5
Zebra BI dashboard template, https://zebrabi.com/template/sales-dashboard-power-bi-template/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
316 11 Business Intelligence (BI)
Performance monitoring
Performance monitoring refers to the continuous tracking and analysis of an organization’s
operational and strategic performance using (BI) tools and systems to provide a foundation
for frequent decision-making, process improvements, and meeting business objectives. It is
widely used in areas such as finance to monitor financial health, sales to monitor revenue
and prospect to customer conversion rates, operations monitoring to assess efficiency and
productivity, IT to ensure system reliability and performance, and marketing to evaluate
campaign ROI, customer satisfaction and retention. In Figure 11.12, a performance monitor-
ing dashboard provides current information on flight delays by region and airline with his-
torical information for reference. For the chosen airlines with high flight volumes, the aver-
age delay per flight is compared daily for the dynamically selected time interval (5 months
out of 10 years shown, while the dashboard quickly responds to different user selections
generally in less than 1 second).6
Figure 11.12 Performance monitoring dashboard: airline flights and arrival delays
Operational reporting
Operational reporting details the ongoing activities and transactions in relevant business
processes and provides an update on the day-to-day functions of an organization, depart-
ment or business unit. Generated on a frequent basis—typically daily or weekly—these
reports help operational staff and managers to perform regular activities and make timely
decisions to keep operations running efficiently. Operational reports focus on short-term
performance and information for direct day-to-day decisions, typically at a departmental
6
“Airline flight delays”: Analysis based on Tableau sample dataset (ca. 70 million flights, 1 record per flight from
1999–2009) derived from U. S. Department of Transportation‘s Bureau of Transportation Statistic.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.3 Reporting and Data Analysis 317
level. They include metrics such as sales, inventory levels, production output, and customer
service indicators.
Analytical reporting
Analytical reporting goes beyond simply presenting data for a point-in-time snapshot of the
business. These reports provide in-depth analysis, findings, and insights about trends, pat-
terns, and anomalies in business performance over time to support informed, fact-based
decisions. The results cover nuances on topics such as financial performance, market anal-
ysis, customer behavior and operational efficiency. Analytical reports are—currently—being
produced by analysts through the examination of historical data, application of statistical
analysis, and use predictive models sometimes. The process uncovers patterns in business
performance and forecasts potential future scenarios, providing a detailed insight into busi-
ness operations. The results are key to helping business leaders make strategic decisions.
While analytical reports usually involve interpretation in the specific business context, pre-
sentation and advice to decision-makers, which is very much a human task, parts of the
process of compiling figures for a wide variety of business questions—often not answered in
standard reports—has increasing automation potential through the application of AI.
***
Because of these very different usage scenarios and types of reporting, it is not surprising
that they are addressed by a range of report types that deliver results. The following list of
report types represent characteristic outputs of reporting activity and different ways in
which BI transforms data into actionable information to help its stakeholders monitor per-
formance, ensure compliance, make informed decisions, and gain advantage from data.
Standard reports
Standard reports are pre-defined, regular, routine reports that provide ongoing visibility of
business operations. By addressing clearly defined information needs with established re-
porting criteria and intervals, they are important resources for managers and stakeholders
to track progress, identify trends and obtain relevant and timely information for decision-
making. Examples include monthly sales reports, marketing campaign evaluations or in-
ventory level reports.
Ad hoc reports
Ad hoc reports are created on demand to respond to specific queries or issues as they arise.
Examples include customer feedback analysis, environmental impact assessment, compet
itor pricing updates, or incident analysis reports.
tween data collection and reporting) to enable immediate and dynamic decision-making.
Examples include production line monitoring, a call center operations dashboard, an energy
trading dashboard with external and internal information, or a solution for monitoring
power generation that tracks the output of renewable energy assets.
Dashboards
Dashboard is an umbrella term for displays that contain multiple graphical information el-
ements in an interactive user interface. Because dashboards serve many different purposes,
a broad definition by Nick Desbarats that encompasses them is simply “any display with a
bunch of charts on it”, using an analogy to the general purpose meaning of documents (“any
page(s) with words and/or images on them”). Common uses for dashboards are to provide
an overview of business objectives or monitor the performance of business processes using
aggregated data points in the form of KPIs, metrics and charts. Examples include executive
dashboards for the CEO or CFO, operations dashboards to monitor customer service in a call
center or to track order fulfilment processes, marketing dashboards to track campaign per-
formance, or sales dashboards to report win rates and other information on opportunities
won and lost, and to help convert opportunities into deals.
Scorecards
Scorecards are tools used to monitor the implementation of business strategy by comparing
strategic goals with results, often including key performance indicators (KPIs) to measure
progress. Examples include balanced scorecards in general, or employee performance—or
customer service scorecards. The management instrument “Balanced Scorecard” includes
financial and non-financial performance indicators across four key perspectives to provide
a holistic view of business performance: financial, customer, internal business processes,
and learning and growth. This also helps to balance the risk of optimizing just one area and
its KPIs, while at the same time negatively impacting other parts of the business.
In the context of BI, data analysis is the systematic process of finding, exploring, summariz-
ing, interpreting, and narrating data to discover useful information, draw conclusions, and
communicate for decision support and action. It is the core task to transform raw data into
actionable information and insights. This enables businesspeople to monitor and better
understand their operations, evaluate business processes and performance, and identify
opportunities for organizational improvement. To be able to derive meaningful insights
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.3 Reporting and Data Analysis 319
leading to business impact, a data analyst needs analytical and critical thinking, technical
data skills, and business acumen to fine-tune this process towards valuable results.
Data analysis enables organizations to understand past and present performance or even
predict future trends and prescribe actionable strategies—the latter is typically referred to
as analytics. Depending on its focus and methodology, the following types of data analysis
can be distinguished:
Descriptive analysis
Descriptive analysis looks at historical data to understand what has happened. It involves
summarizing and reporting data about past events and making it interpretable, often
through visualizations like charts and graphs. Descriptive analysis often includes additional
information to compare actual data with benchmarks, targets, plans or forecasts, and appro-
priate reference periods to put the measured business performance into context.
Diagnostic analysis
Diagnostic analysis goes a step further by examining data to understand why something
happened. It requires deeper data exploration and analysis, such as correlation or regres-
sion analysis, to identify the causes or influences of trends and events.
Predictive analysis
Predictive analysis leverages historical data to forecast future outcomes. It relies on statisti-
cal techniques and models, including machine learning algorithms, to predict future trends,
behaviors or events based on current and historical data.
Prescriptive analysis
Prescriptive analysis recommends actions to achieve specific goals or desired outcomes. It
combines insights from all other analyses to suggest decision options and their i mplications,
often using optimization and simulation algorithms.
***
A popular way to support human data analysis in recent years has been the use of visual
analysis (before generative AI functionality being integrated into BI and analytics software).
Led by innovations from companies such as Tableau in the 2010s and early 2020s, this
functionality has differentiated modern BI software from traditional (enterprise) reporting
tools and made these products leaders in the BI software market (evident in the develop-
ment of Gartner’s Magic Quadrant for Analytics and Business Intelligence Platforms).
Visual analysis is valuable not only because it lowers the barrier to entry for business users
working with data, but also because it increases the speed and agility with which infor
mation needs can be satisfied in different parts of the organization.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
320 11 Business Intelligence (BI)
Visual analysis is an interactive and exploratory approach that focuses on the human-
centered use of computer tools for the visual exploration of data. Using this method, users
engage with their data through visual representations to gain meaningful insights. This
process makes complex data sets visible, accessible, and understandable by tapping into the
human capacity for visual perception. Visual analysis is characterized by its emphasis on
user interaction and exploration, leading to a better and deeper understanding of the data
than what is revealed by analysis using statistical analysis alone, or using traditional reports
with tables of numbers and a few charts on top. The typically iterative process in the visual
analysis cycle is illustrated in Figure 11.13.7
An example of visual analysis is shown in Figure 11.14, with each step showing the busi-
ness question asked, the analysis action performed and the resulting visual representation.8
7
Hanrahan, Stolte, Mackinlay (2007), page 9: Visual Analysis for Everyone, http://www.tableau.com/sites/default/
files/whitepapers/visual-analysis-for-everyone.pdf
8
Hanrahan, Stolte, Mackinlay (2007), page 4: Visual Analysis for Everyone, http://www.tableau.com/sites/default/
files/whitepapers/visual-analysis-for-everyone.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.3 Reporting and Data Analysis 321
If you have heard of visual analytics, you may now wonder if there is a difference to visual
analysis at all. Yes, there is—at least if the term hasn’t been used casually in marketing.
While visual (data) analysis focuses on the “human in the loop” as mentioned before, visual
analytics integrates automated analytical techniques using algorithms and statistical meth-
ods with data visualization and interactive analysis. The focus is on machine data process-
ing complemented by visual analysis methods for human perception and user interaction.
Now that we have discussed different forms of reporting, it is time to highlight two major
trends in BI, reporting and data analysis that involve the integration of new technologies
into leading BI and analytics software, namely augmented analytics, and generative AI.
These have been in development for years, and augmented analytics is already available in
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
322 11 Business Intelligence (BI)
some product features in leading BI platforms, with varying degrees of usefulness. The
combination of these two trends will amplify the results and save people time. In my opin-
ion, it is likely to become mainstream and generally available in leading BI software for
business users in organizations (perhaps like the rise of a well-known chatbot).
Augmented Analytics
Augmented analytics leverages artificial intelligence and machine learning to enable more
intuitive data exploration and analysis and to automate data preparation, making BI tools
more accessible to non-expert users and extending the culture of data-driven decision-mak-
ing within organizations.
Examples include two-way user interaction via spoken or typed queries through natural
language processing (NLP), generating predictions such as forecasts based on historical
data, automated anomaly detection that identifies unexpected patterns or outliers in the
data, and assisting with root cause analysis by identifying key influencing factors.
Generative AI
Generative AI will play a transformative role in reporting and data analysis, enabling new
content and insights to be created much faster, as well as narrating and summarizing rele-
vant information from complex data sets, saving time and reducing manual human effort.
Beyond traditional data analysis, I expect AI to not only be able to crunch and report data,
but also assist in building predictive models, simulations and creating synthetic data.
Finally, it will be able to provide natural language summaries which will help people under-
stand the facts better and could reduce the time it takes to make decisions, provided the
results are accurate and trustworthy.
Let’s explore the many components that can make up a full BI system or (modern) “data
stack”.
Data modeling
Data modeling software such as erwin Data Modeler, is used to define the data structure of
a DW, including schemas, fact tables and dimension tables, based on the business require-
ments and relationships in the data.
Metadata management
Metadata management comprises tools and processes for managing metadata, which is the
data about the data sources, transformations, and data structures. Metadata is an element
to enable data governance by ensuring that the data assets of an organization are well-
documented (e. g., definitions for metrics and KPIs), and making its data lineage traceable
(answering, where certain information, that is e. g. included in a report, originated from), so
that impacts of changes can be anticipated and considered accordingly. Metadata manage-
ment software such as Alation, Collibra provides data catalogues, to support data gover-
nance activities by providing a collaborative environment for capturing essential metadata
with both technical implementation information, as well as corresponding business defini-
tions.
and accessibility for analysis and decision-making. Beyond traditional ETL (Extract, Trans-
form, Load) and ELT (Extract, Load, Transform) tools and cloud-based data integration
services, this segment also includes data virtualization, data federation, API management
platforms, Middleware solutions, and real-time data processing solutions.
Cloud data integration services offer scalable, cloud-based solutions for integrating,
processing, and managing data across diverse sources, whether in the cloud or on-prem-
ise. These services facilitate the seamless data flow, enabling businesses to build and
manage complex data pipelines with features like pre-built connectors and automation to
reduce traditional infrastructure complexities and support quicker, more agile data inges-
tion. Example products include Amazon Web Services (AWS) Glue, or Microsoft Azure
Data Factory.
Data virtualization and federation technologies enhance data accessibility and manage-
ment by providing capabilities to create an abstraction layer or a virtual database, allow-
ing users to access and manipulate data from multiple sources as if working within a
single database environment and unified data model, without the need for physical data
movement or replication. Example products include Denodo, Informatica PowerCenter
Data Virtualization.
ETL tools such as Informatica, Talend, Microsoft SQL Server Integration Services (SSIS)
run the data processing workflows and perform data extraction from source systems (E),
data transformation (T), and loading (L). The last step typically writes into a database.
ELT tools and services facilitate the extraction of data (E), loading it directly into a target
data store (L), and then transforming it as needed (T), streamlining data integration and
processing for analytical readiness. Examples of cloud-based ELT tools and services in-
clude DBT, Fivetran and MuleSoft.
API management platforms facilitate the creation, management, and deployment of
APIs. These APIs provide interfaces for secure, controlled interaction and data sharing
between applications. Cloud service examples include AWS API Gateway or Microsoft
Azure API Management.
Middleware systems can act as a bridge between different (operational) applications and
databases, facilitating the seamless data flow across systems. Example products include
IBM WebSphere, Oracle Fusion Middleware, and Red Hat JBoss Middleware.
Enterprise reporting
Enterprise reporting is a software product category designed to centralize, manage, and
distribute data, primarily through reports and dashboards. Its goal is to overcome disparate
departmental information in organizations and to deliver a holistic view of business perfor-
mance by providing a centrally managed and accessible reporting platform. It integrates
with a wide range of data sources and enables access, collaboration, and secure distribution
of information. Well-known products include SAP BusinessObjects or IBM Cognos.
Advanced analytics
Advanced analytics technology provides the ability to perform sophisticated analytical tech-
niques such as machine learning, predictive analytics, and data mining to extract deep in-
sights from data. It is often implemented using the programming languages R and Python
with dedicated open-source libraries for these purposes, or commercial software packages
like SAS.
Data lakes
Data lakes are a centralized repository that allows you to store all your structured and un-
structured data at any scale. Data lakes enable the storage of raw data in its native format
until it is needed, and then tools can be used to analyze, visualize, and process the data.
Examples include Amazon S3 or Azure Data Lake Storage.
Data lakehouses
Data lakehouses are a new, open data management architecture that combines the flexibil-
ity, cost-efficiency, and scale of Data Lakes with the data management and ACID (Atomicity,
Consistency, Isolation, Durability) features of traditional databases. Data lakehouses are
designed to support both BI and ML workloads on all data types. Examples include Data-
bricks Lakehouse or Snowflake.
Let’s explore the many components that can make up a full BI system or (modern) “data
stack”.
In a large software segment like business intelligence and analytics platforms, which has
been growing steadily for decades and is all about improving business performance, it is not
surprising that there is a wide choice of tools to make people productive in their businesses,
able to access actionable information to make better decisions and turn them into a compet-
itive advantage.
Because various BI tools use different approaches and methods tailored to certain data chal-
lenges and analytical requirements, it is helpful to have some context about the capabilities
and unique features to apply appropriate solutions to meet specific business needs. Choos-
ing a particular tool or platform is a strategic decision for a significant time-period, typically
several years, with usually significant switching costs associated in case of subsequent
changes.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.3 Reporting and Data Analysis 327
Disclaimer: The purpose of the following list is to give you an idea of some of the widely
used BI tools that you may encounter in your work. Each tool represents only one of several
possible choices for data scientists, BI professionals, or data analysts in the respective soft-
ware product categories. Please note that there are other categories of BI software (see the
list of BI technologies above), as well as other excellent BI tools on the market. The first step
in any tool selection is always to understand your requirements, skill level and way of work-
ing. For more information on where to get a representative list of BI tools, independent
comparisons, and recommendations, see my closing remarks in this section.
Tableau by Salesforce
Tableau is an advanced business analytics tool known for its exceptional visual analytics
and data visualization capabilities. It enables users to easily see and understand their data
through visual summaries that can be used to make decisions and act. Tableau is used to
create, share, and consume interactive reports and dashboards. It stands out for its easy-to-
use interface that enables business users to quickly work with data from many different
sources and turn large and complex data into clear visual representations for faster data
analysis and communication to stakeholders.
Microsoft Power BI
Power BI is an advanced business analytics tool that connects to many different data sources,
simplifies data preparation, facilitates ad hoc analysis and the creation and sharing of inter-
active reports and dashboards. Power BI integrates well with the Microsoft Office software
services and, as a cloud platform, can conveniently deliver relevant information across an
organization.
MicroStrategy
A comprehensive business intelligence platform that provides powerful analytics and mo-
bile BI solutions. MicroStrategy is known for its high scalability and security, offering ad-
vanced analytics, mobile apps, and BI dashboards. It stands out with its in-depth analytical
capabilities and efficient data processing, especially for large data warehousing environ-
ments as the basis for standard and ad hoc reports.
KNIME Analytics
The KNIME Analytics Platform is an open-source software package offering a complete suite
for data analytics and data integration. With a user-friendly interface it simplifies analysis
and the creation of data processing workflows without extensive IT or programming skills.
It supports diverse data types and formats and enables advanced data handling, statistical
analysis, and a range of visualization options. KNIME’s extensive list of “nodes”—building
blocks for data processing—cover tasks from preprocessing and modeling to data mining
and machine learning, making it adaptable for a wide range of data projects towards data
science, automation, data integration, and capable of classic BI and reporting use cases.
***
Closing remarks: Representative information on the large BI and Analytics software mar-
ket can be found in the yearly evaluations from leading market research companies. For
example, from Gartner in its “Gartner Magic Quadrant for Business Intelligence and Analyt-
ics Platforms”, or from BARC in the “The BI & Analytics Survey”. As the software market
and the capabilities of vendors’ software products are constantly changing, it is highly rec-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.4 BI and Data Science: Complementary Disciplines 329
ommended that you seek independent advice and comparisons for any tool selection and
purchase process, as well as trying them out for yourself so that you can make your own
judgment. Also, take advantage of hands-on proof-of-concept support from vendors or sys-
tem integrators to help you make your selection.
■ 11.4 B
I and Data Science:
Complementary Disciplines
The distinct business technology domains of business intelligence (BI) and data science
(DS) both use data to create value and play an important role in leveraging data for business
success. However, they do so in different ways. An initial distinction between BI and DS was
made at the beginning of this chapter. On this basis, we will now look more closely at under-
standing the differences, similarities, and complementary nature of BI and DS to help peo-
ple in both disciplines understand and work with each other and organize collaboration
properly.
11.4.1 Differences
A key difference between business intelligence (BI) and data science are in their objectives,
methodologies, and the key technologies and tools used.
Business intelligence focuses on providing actionable data to support information needs
across the business through reporting, analysis, and visualization, producing KPIs, dash-
boards and reports. BI typically looks at current and historical data to answer questions
about what has happened and what is happening, using factual input to support operational
or strategic activities and business planning. Relevant findings are often the result of put-
ting actual figures into the context of other periods, budgets, or forecasts—enabling the
PDCA cycle (Plan-Do-Check-Act), a common management method for control and continu-
ous improvement.
Data science, on the other hand, typically answers different types of questions than BI. It
draws from a broad arsenal of advanced statistical methods and algorithms, data mining,
and machine learning techniques to uncover deeper, often hidden, insights. Often, the focus
is also on future activities and outcomes, which require forecasting and the creation of pre-
dictive models. As the term science in its name suggests, exploratory and experimental
work is involved to find patterns, predict trends with high confidence, and provide in-depth
analysis of high-value business issues, while the outcome is initially uncertain. Another
difference is that data science makes use of the scientific process. This starts with formu
lating a testable hypothesis, grounded by data exploration and domain expertise. This is
followed by designing and iterating experiments using advanced analytical methods and
translating complex results into insights by rigorous testing and validation of predictive
models.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
330 11 Business Intelligence (BI)
11.4.2 Similarities
Business intelligence and data science professionals share many skills. These individuals
operate with a problem-solving mindset and require analytical and critical thinking skills.
They are passionate about deriving valuable business information and insights from data,
providing data-enabled solutions to business challenges, and facilitating decision-making to
improve business outcomes. This common purpose provides a good basis for synergies be-
tween the two disciplines.
Both professions use a variety of data handling techniques, including data collection,
pre-processing, and cleaning, to ensure the quality of the data required for appropriate re-
sults. In their day-to-day work, they perform data analysis and querying, using analytical
skills to interpret data and statistical methods to find patterns and trends. They also rely on
visualization techniques to present their findings effectively and accessible. Both roles re-
quire a foundation in data governance and ethics to ensure data security and privacy.
11.4.3 Interdependencies
Organizational integration
Integrating both BI and data science can benefit both disciplines and be very valuable for an
organization. BI’s strengths in reporting and visualization play an important role in commu-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.5 Outlook for Business Intelligence 331
nicating key information to stakeholders. Data science, on the other hand, is able to find
patterns that would otherwise remain hidden for too long, as well as having predictive capa-
bilities that can guide strategic planning and innovation when communicated in an under-
standable and actionable way. Together, they can deliver a much more complete view of both
the current state of the business and future opportunities. By understanding and leveraging
the combined strengths of BI and data science, organizations can make data-driven deci-
sions more effectively and pave the way for significant progress.
Cloud-based BI
There’s a growing trend towards cloud-based BI solutions, offering scalability, cost-effective-
ness, and the ability to access data and analytics tools remotely. For example, it can simplify
the IT set-up to enable global teams to access near real-time sales, customer, and supply
chain information—from any web-enabled device.
AI and ML integration
AI and ML are increasingly being integrated into BI tools and systems, extending capabili-
ties from descriptive and diagnostic analysis to predictive and prescriptive analysis, provid-
ing deeper insights and improving decision-making.
Data democratization
Businesses will experience an increase in the democratization of data, driven by the inte-
gration of generative AI for smarter support in data analysis. Generating insights from data
will become more accessible to a wider range of business users, as asking the right ques-
tions becomes more important than ever, while technological barriers to using data are
reduced. Business users will be able to use generative AI capabilities in end-user tools such
as Microsoft Excel, Power BI or Tableau to answer more questions and perform more tasks
themselves.
Business partner
More than ever, these trends are enabling BI teams to advance from mere providers of reli-
able historical and current information, which will remain important, to strategic business
partners, that provide analytical guidance to the business, including predictive insights and
recommendations for future action.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
11.6 In a Nutshell 333
■ 11.6 In a Nutshell
Machine Learning
Georg Langs, Katherine Munro, Rania Wazir
■ 12.1 Introduction
Machine learning is a branch of artificial intelligence that follows an interesting approach:
Instead of implementing sets of rules and walking through them using static algorithms,
rules or models are learned from examples. For example, instead of designing a decision
tree by hand that determines the logical voting behavior of people based on characteristics
such as their place of residence, income, age and the make of car they own, a large number
of examples are collected, for each of which the place of residence, income, age, make of car
and voting behavior are known. This data is the training data, with the aid of which learning
algorithms generate models. Presented with a new example, such models provide a predic
tion of what the most likely voting behavior is. In this context, we call the result of the model
a “prediction”, even if it does not necessarily concern an observation in the future.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
336 12 Machine Learning
Categories (2, 3, …)
Continual value
Model ?
Image
Measurement With the help of training examples we
train a model, which assigns an output
Test fragment to each input.
Lab value
…
Description (Label)
Object
The basic building blocks of machine learning are observations such as images, measure
ments, texts, laboratory values or sound signals, and a target variable that each observation
is connected to and that a model is supposed to predict. Target variables can be discrete
categories (colors, party, diagnosis) or continuous values (temperature, size, monetary
value). In the first case we speak of a classification model that maps observations to classes,
in the second case of a regression model that maps observations to continuous values.
Research and development of models and learning algorithms that turn a large number of
examples into prediction machines has been around for a long time. The prediction engines
can get very complex, but the underlying principle is often similar: how do you compare
observations and draw conclusions about an example without a label from those examples
for which you do know the label?
If the weather report records rain in all of the surrounding districts, you don’t need to look
out the window to make a likely correct forecast that rain will fall here as well. We call this
the nearest neighbor classifier, and it is one of the simplest, but surprisingly successful,
mechanisms for using observations to make predictions. George Chen and Devavrat Shah
provide an excellent discussion about the power of this approach tracing it back to the 11th
century1 Today it is experiencing a renaissance, as it delivers surprisingly good results,
especially with very large amounts of data. It is therefore a good starting point for establish
ing models and characteristics.
1
Chen, George H., and Devavrat Shah. “Explaining the Success of Nearest Neighbor Methods in Prediction.”
Foundations and Trends in Machine Learning 10.5-6 (2018): 337–588.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.2 Basics: Feature Spaces 337
Model ? Model ?
Feature extraction is one of the critical components of successful prediction, and for a long
time, work on improving feature extraction has been one of the most important areas of re
search in pattern recognition and machine learning. An important part of the improvement
of prediction models occurred in this area, and one of the most influential steps — similar to
the step from manually creating rules to training, with the aid of examples — was the step
towards algorithmic selection of features (discussed below) and finally, learning algorith
mically how to extract feature based on the training data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
338 12 Machine Learning
9.43
2.54
3.95
8.35
3.36
Feature extraction
Model ?
Prediction model
Figure 12.3 Feature extraction: An observation is converted into a feature vector. The vector is
a point in a feature space and prediction models only work on this representation. They can be
thought of as mappings from a feature space to a target space.
1. Hand crafted features: The design of feature extractors that turn obser
vations into feature vectors is carried out by experts. Features should on
the one hand contain sufficient information for the prediction model and on
the other hand do not contain any information that could have a disruptive
effect. This is especially relevant in models that are unstable if irrelevant
information is making up substantial parts of the features used as input.
In scenarios in which there are precise hypotheses regarding the relation
ship between observation and target variable, using this knowledge is an
excellent strategy to construct or select features. Hence, domain knowl
edge based feature construction continues to be relevant when there are
reasonable expectations about the predictive value of certain characteris
tics. A rule of thumb: if you know of a meaningful relationship, you should
use it.
2. Algorithmic feature selection: Instead of specifying a narrow set of
carefully selected features, selection is performed by an algorithm. It is
provided with a set of features — many of them possibly worthless or
spurious — and selects features that have predictive power. Algorithmic
feature selection can either be univariate, when each feature is assessed in
isolation from the others, or multivariate, when the features are used and
assessed as part of a larger whole. One example are bagging classifiers
such as random forests (see section 12.4.2), which as part of classification
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.2 Basics: Feature Spaces 339
The three dominant strategies for feature extraction were presented in the box above. The
progress from initially manual design to automatic feature selection and finally to learning
feature extractors is closely related to the development of predictive models in general.
Imagine a nearest neighbor classifier (NN classifier) that is supposed to determine whether
a car is a truck or a passenger car based on latitude, longitude and height. With a few hun
dred training examples, the classification probably works well, since the two classes, with
rare exceptions, fill characteristic regions in the feature space. If we take color instead of
length, the NN classifier collapses, since the distances are dominated by a worthless vari
able, and the height — although informative in itself — only has a comparatively small share
in the signal. The selection and scaling of the features becomes a question that we must
answer before using predictive models.
Classifiers such as nearest neighbor classifiers and, to a certain extent, Support Vector Ma
chines (SVM) suffer from sensitivity to too many non-informative features. This has resulted
in an enormous amount of effort put into the design of features for such algorithms.
A major step that improved the accuracy and reliability of classifiers and regression models
was the development of methods that can cope with a substantial number of non-informa
tive features. Examples of this are boosting models (AdaBoost)2 or bagging models (Random
Forest)3. As part of the training process, both model families assess the contribution of each
individual feature to the correct prediction and select the final features accordingly. This
can mean that different features can be used for different subsets of training examples. The
trick for both model types is to use a large number of very simple so-called “weak learners”
that only have access to part of the training data and part of the overall feature set during
training. It is a very different strategy from using a single complex model that works with
all examples and characteristics at the same time. After the training has been completed, all
weak learners can be used to predict a new observation, and their respective predictions
can be summarized as a voting result. This has a fascinating effect: weak learners who can’t
do anything with the new example because they haven’t seen anything comparable during
the training produce a comparatively even — random — distribution of votes. Those weak
2
Rätsch, G., Onoda, T., & Müller, K. R. (2001). Soft margins for AdaBoost. Machine learning, 42(3), 287–320
3
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
340 12 Machine Learning
learners who cast correct votes tend to agree with each other — which typically leads to an
often gentle but reliable win in the correct class.4
Some methods go one step further, in that they don’t even select from a set of provided fea
tures, but rather, learn to extract features themselves from the input data. Deep learning
methods such as convolutional neural networks (CNN) belong to this type of model. On a
certain class of data for which neighborhood relations are important (such as images, in
which objects are made of clusters of neighboring pixels), CNNs learn feature extractors in
the form of filters, which on the one hand are particularly well suited to capturing the vari
ability of the data and on the other hand result in filter responses that contribute to the
prediction of the target variable.
?
?
Lorries
Lorries
Height
Height
Cars Cars
Length Color
4
Bühlmann, P., & Yu, B. (2002). Analyzing bagging. The Annals of Statistics, 30(4), 927–961
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.3 Classification Models 341
12.3.1 K-Nearest-Neighbor-Classifier
On the one hand, the method is very simple to implement, and while the rapid retrieval of
the nearest neighbors is a challenge with large training data sets, computational power, and
effective data structures such as kd-trees have solved this to some extent. This has recently
led to a renewed increase in their relevance. However, a limitation illustrated in Figure 12.4
remains: k-NN-Classifiers depend on a meaningful distance to determine the closest neigh
bors. In feature spaces that represent diverse characteristics such as height, length, color,
or speed of cars, it is difficult to decide for a correct scaling of the different coordinate axes
in the feature space. Is speed more or less important than height when classifying between
a car and a lorry? Non-informative features can lead to a complete failure of the classifier.
Hence, the dependence of k-NN-Classifiers on informed and careful feature construction
and selection guided by expert knowledge.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
342 12 Machine Learning
Support Vector Machines (SVM), similar to k-NN-Classifiers, assume a relatively dense dis
tribution of training examples in the feature space. Instead of using all examples to classify
a new one, SVMs identify so-called support vectors (SV) during training. These are vectors
that are close to the class boundary in the feature space. Support vectors are chosen in such
a way that the distance between the points of the two classes has the maximum distance to
the separation area. The interface is a so-called “hyperplane” (it has one dimension less than
the feature space). Through their position and a kernel that determines their area of influ
ence — or the analogous distortion of the feature space — they define the class boundary.
When a new example is classified, only the coordinates of these SVs have to be used. SVMs
have been around for a long time and in some cases are still widespread in applications
ranging from the analysis of seismic signals to proteins. Again, an important limitation on
their use relates to the choice of features. SVMs are substantially affected by interfering
features and therefore careful design and choice of features is important analogously to
k-NN-Classifiers. This reduces their applicability, since it often requires resources for
feature design that are not necessary with other methods, such as the ensemble methods
described later in this chapter.
Support vectors
Figure 12.6 Support vector machines define the boundary between two classes in the feature
space using support vectors.
Decision trees initially treat each feature as a quantity that can be broken down into discrete
categories that drive the decisions at its bifurcations. During the training, a tree forms
branches, and examples travel along the branches guided by decisions made at each bifur
cation. The decision at each bifurcation is determined by a feature. If the feature is categor
ical, each category results in a branch starting at the bifurcation. If the feature is scalar, we
use a threshold value, so that an example goes one or the other way, depending on whether
the corresponding observed feature is larger or smaller than the threshold value.
During training, training examples “travel” along branches chosen based on their observed
features and the corresponding decisions made at each bifurcation. Each example will end
up in one of the branches, resulting in a set of labels in each branch. If this set only contains
a single label, we turn the branch into a leaf and assign it this very label. If the set contains
multiple labels, we attach a new bifurcation to the end of the branch, and let a different
feature further separate examples with different labels. This continues until we have either
only leaves, or until we have reached a certain allowed depth of the tree. In the latter case
the last level of branches is turned into leaves, but some may contain examples with differ
ent classes. There, the ratio of examples with a certain class can be treated as a probability
of this class being present in the leaf.
The so-called CART algorithm is a method to train decision trees using examples for which
binary labels are known.5
When we use a trained decision tree to classify a new example, for which we don’t know the
label, we let it travel along the bifurcations, routed by its features, and assign the label of the
final leaf it reaches. The category of the leaf is output as the final classification result of
the decision tree. We can use decision trees for regression tasks as well. Then instead of
assigning the example a discrete categorical label, we assign the average of the target values
observed in the training set examples that reach the particular leaf.
Decision trees are sensitive to the risk of so-called “overfitting”. This means that from a
certain branching level the decisions continue to separate the training data better and bet
ter, but this separation might work only on the training data. On new data it might actually
deteriorate the classification accuracy — the agreement between the true labels, and the
output of the classifier. This is called “overfitting”. The result is that new data are not sorted
as correctly as the training data when running through the decision tree. One means of
preventing this is so-called “pruning”, which means simply pruning the decision tree (by
removing leaves and nodes) to a branching level that is stable and generalizes well. The
success of such an approach can be tested using cross-validation methods. In Section 12.10
we will look into more details of evaluation methods. Another, usually more successful,
method is to use decision trees as elements in ensemble methods such as random forests,
which we discussed in Section 12.4.
Note that we briefly mentioned the transformation of a continuous variable (a number) to a
categorical variable (A or B) before. Let us look at this in a bit more detail. There are more
or less elegant ways to turn a continuous variable into a categorical one or vice versa. We
can threshold a scalar to turn each observation into one of two categories (smaller or larger
than the threshold). We can also line categories up, for instance A - B - C - D, and treat this
5
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Cart. Classification and Regression Trees;
Wadsworth and Brooks/Cole: Monterey, CA, USA
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
344 12 Machine Learning
as a continuous variable. While this is possible, careful attention must be paid to the im
plicit assumptions. For example, a vector space contains a structure that results from dis
tances, which is to say that the phrase “point A is between B and C” makes sense and also
means something. This is not necessarily the case with categorical variables. If you turn a
categorical into a continuous variable, it can happen that you encode properties that are not
contained in the data or their relationships. For example, if you convert red - green - blue to
1 - 2 - 3, you may be encoding a sense of order which is not present in the original data. In
the next section we will see a successful way of creating very powerful classifiers by con
verting continuous classification problems in vector spaces to categorical decisions in many
decision trees.
The development of ensemble methods such as bagging and boosting was motivated by an
effort to improve two core characteristics of classification or regression models: bias and
variance.
Bias and variance are two quantities that are relevant in the context of machine learning
and predictive models in general. They describe the capacity of a model to make repeatable
predictions on the one hand (low variance) and on the other hand to be close to the truth
with these predictions (low bias). An excellent discussion of these quantities in the context
of ensemble methods was given by Geurts et al. in 20066, and by Geman et al. in 19927.
6
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine learning, 63(1), 3–42
7
Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural
computation, 4(1), 1–58
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
346 12 Machine Learning
When training a model, the aim is on the one hand to represent the distribution of the train
ing data in the best possible way with the model by optimizing the model parameters, such
as the support vectors in a support vector machine. At the same time, we know that the
training examples are only a finite limited sample from the actual distribution. Therefore,
blindly aiming for perfect separation on the training data can lead us astray. At the same
time, we want to make sure that the model can also be generalized to new data, that is, it
also reflects their distribution, even though the new data were not available during training.
As shown in Figure 12.8, a good example of a trade-off between bias and variance can be
achieved by controlling the complexity of a model. A very complex model that is trained on
too little data can represent the distribution of this data optimally, producing a classification
error rate on the training data of 0, but the generalization to new data suffers. When the
model is used to predict new data, there is high variance in the predictions that is indepen
dent of the variance in the training data. This situation is called overfitting. It is character
ized by low bias and high variance. The opposite happens when the model is too simple. In
this case, the bias is high because the distribution of the training data is not reproduced
well, but at the same time the variance is low and the error rate for new data that was not
available during the training will be roughly the same as the error in the training data set.
This situation is called underfitting.
Random Forests (RF), introduced by Leo Breiman, are one of the most prominent examples
of an ensemble method.8 These powerful and complex classifiers, which are part of a class
of so-called “bagging classifiers” are not trained on all data at once during training. Instead,
a large number of so-called “weak learners” are trained. These are each simple: for example,
8
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.4 Ensemble Methods 347
they may simply make a decision based on a threshold value of a feature (such as consider
ing all vehicles longer than 4 meters to be a truck, and all vehicles up to this length a car).
The training accepts that each of these weak learners cannot provide a reliable or precise
classification result on their own: Only the combination of their predictions becomes inter
esting. To make this possible, each of the weak learners is trained on a subset of the entire
available training data set, and at the same time they can also use only a part of the char
acteristics, or even only a single characteristic, for the classification. This leads to a large
number of simple models, each of which uses only a selection of the available features.
In the case of a random forest, the weak learners are decision trees (see Figure 12.9), hence
the name. Each of the decision trees is trained on a subset of the training data set. At each
branch, a random set of features are made available to the decision tree, and the training
algorithm selects the feature that leads to the maximum gain in purity of the two data sets
resulting from the decision at this decision branch. Here, purity is used in the sense of how
many different labels are contained in a data set. A data set with examples having the same
label would be pure compared to a set of examples with many different labels. This means
that the decision — for example based on a threshold value of a feature at a branch — leads
to two sets of examples that result from the branch (cars longer or shorter than 4 meters).
These amounts are typically not varietal, but they should be more varietal than the amount
before branching. This mechanism continues until each lot (each leaf) is sorted; that is, the
training data set used for this decision tree is perfectly classified.
When training a random forest, the decision trees are each trained using a randomly se
lected subset of the training data. This means that each decision tree “sees” only part of the
data and can consequently only work well with data whose characteristics are represented
by this set (we will come a little further on in this section to an interesting effect that results
from this). At the same time, the features that are available to each of the decision nodes —
each branch — in each of the decision trees is randomly selected from all the available fea
tures. Only on this subset of features does the algorithm look for the feature that leads to
the most correct separation of the classes at this branch.
A random forest consists of a large number of decision trees that have been trained using
various examples and features. To classify a new observation, all decision trees are first
applied and thus cast their “vote”, so to speak. This can be given in the form of a vote for a
class, if the decision trees have been trained up to leaves containing only samples of a single
class. Or it is given as a probability that corresponds to the ratio of the classes in the leaf, if
the training of the tree was stopped while leaves still contain examples with different labels.
The final classification result is given by the mode of votes, the label, that obtained the ma
jority consensus of all weak learners.
Why are the two key ingredients — examples and features — chosen at random? This creates
robustness, every decision tree works well on part of the data, which improves the capacity
of the final random forest classifier. When classifying a new observation in the form of a
feature vector, trees that have already “seen” similar things tend to be correct. In contrast,
the trees for whom this observation does not come close to any of the examples they have
already seen, react more randomly. The correct class wins because the trees that tend to be
correct also tend to agree in this case, contributing to a majority, while the others produce
random noise. Random forests are among the most robust and best classifiers available, and
require very little adjustment and fine-tuning during training.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
348 12 Machine Learning
Figure 12.9 A decision tree corresponds to repeated splitting of the feature space. When a new
example with an unknown class is routed through the decision tree, it ends up in one of the leaves
that correspond to one of the regions in the feature space.
Before discussing the advantages of the random restriction of the features, let us introduce
a second very central ability of random forests: the algorithmic selection of relevant fea
tures. Each time a decision tree performs a branching, the training data is separated follow
ing the decision rule. This allows us to measure and compare the purity of the labels in the
set before the branching decision and the two sets after the decision (Figure 12.9). We
measure the purity of labels in the form of the so-called “Gini Impurity”, which essentially
records how evenly the two classes are distributed in a set. If both classes appear equally
often, this leads to a high Gini impurity. On the other hand, if there is only one class, the
Gini Impurity is zero. This gives us a tool with which to record the decrease in Gini Impurity
each feature has contributed to throughout the Random Forest. It results in the Gini Im
portance of a feature. The Gini importance is high if it has often contributed a lot to improve
class separation, and vice versa.9
9
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.4 Ensemble Methods 349
The simple observation that the training mechanism tends to select informative features,
while not using those that do not contribute to a correct classification, leads to the inter
esting property of random forests of being very robust against worthless features. They are
either never used, or the trees which rely on them tend to lose out in the final voting.
Gini Importance gives us the option of looking for features with a multivariate relationship
to the target variable in the case of complex classification issues. Sometimes, as in func
tional brain imaging, that’s even the primary goal. The random selection of subsets of fea
tures plays a role here insofar as it means that even closely correlated traits have high Gini
importance values, since they can occur independently of one another in different trees.
Without random restriction of the respective features, only one of the correlated character
istics would be used, which would lead to a loss of robustness and the “disappearance” of
potentially informative features.
Training
10
Freund, Yoav, and Robert E. Schapire. “A decision-theoretic generalization of on-line learning and an application to
boosting.” Journal of computer and system sciences 55.1 (1997): 119–139
11
Langs, G., Röhrich, S., Hofmanninger, J., Prayer, F., Pan, J., Herold, C., & Prosch, H. (2018). Machine learning: from
radiomics to discovery and routine. Der Radiologe, 58(1), 1–6
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.5 Unsupervised learning: Learning w
ithout labels 351
We will get to know a completely different level of feature creation in the next section: the
algorithmic construction of features using neural networks. Before we look into them, lets
discuss what machine learning can do if we don’t have labels, and instead, only observa
tions in the form of data points.
12.5.1 Clustering
One of the simplest unsupervised learning approaches is clustering. Here, the goal is to
identify clusters of examples that are more similar to each other than they are to examples
in other clusters. The two key ingredients of clustering are a similarity function that cap
tures relevant characteristics of the data, and an algorithm to assign examples to clusters
based on this similarity function.
For the similarity function, which enables the comparison between any pairs of examples,
it is common to use a distance measure in the feature space. Frequently, we use the Euclid
ean distance between two datapoints (more on this in the next section). The clustering algo
rithm then uses these distances to identify groups of examples, such that examples are
close by to each other within the group, while the groups themselves are far apart.
The k-means clustering method is one example in the family of so-called expectation maxi
mization methods to perform clustering12. In k-means clustering, training examples are first
randomly divided into initial clusters. This is independent of their features or distances, and
we call it a random initialization. In the second step, the mean of each cluster is formed, and
all examples are re-assigned to clusters based on their proximity to these means. Each mean
12
Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm?. Nature biotechnology, 26(8),
897–899
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
352 12 Machine Learning
determines the cluster membership of the examples. In the third step, the means are recal
culated based on the new subdivision of the training data. Steps two and three are repeated
until the means and memberships “converge”: That is, if new iterations don’t change any
thing, or they change so little that we determine the clustering as being finished. For a de
tailed description of using k-means clustering for text classification, see Section 17.4.2 in
Chapter 17.
In manifold learning, we are not necessarily interested in grouping data, although that can
be a secondary step. Instead, we tackle the question of learning a meaningful metric captur
ing the relationships in the training data in the first place Just like with k-means, we might
start out without a good metric reflecting relationships between datapoints. And just like
k-means, we may have to resort to determining a similarity or affinity between a sub-set of
pairs of the overall training data. A similarity measure between all pairs is either not ini
tially available, or we might be doubtful as to whether it is meaningful.
An example is the Euclidean distance between two points on the earth’s surface. If we are
interested in travel time by plane between two points, then this Euclidean distance is only
informative for close by points. For points far apart it represents the distance of drilling
through the earth – exciting, but not a travel experience we are after. Therefore, let’s trust
that the small distances between points correspond to travel time, but that we are only in
terested in traveling within a narrow band around the surface, or on the surface on which
these points are situated.
This surface is called a “manifold”, and within it, locally the Euclidean distance makes
sense. There is a more formal definition of a manifold, but the key point to remember is that
it is a Euclidean space only locally. Manifolds are a tremendously helpful tool for thinking
about our world and representing the relationships between observations.
In the case of the earth, the solution to our problem is to realize that the manifold that cap
tures the relevant distance is a sphere. On this sphere, we can use the geodesic distance
along the surface, to plan for optimal routes. The sphere is the manifold we are traveling
along. In other learning scenarios we might not know the shape of the manifold a priori.
Here, manifold learning translates a set of pairwise distances into a metric, so that we gain
a representational space in which the Euclidean distance represents meaningful travel dis
tances within the manifold. This is particularly relevant if we are observing networks,
which can be thought of as a large number of pairwise relationships. Manifold learning
approaches include isomaps13, diffusion maps14, or methods we will discuss later in this
chapter, such as autoencoders.
To be useful for subsequent clustering and analysis, manifold learning approaches typically
transform all the points that make up the manifold into a new so-called embedding space.
In this embedding space, each point is represented by a point, but now the Euclidean metric
13
Balasubramanian, Mukund, et al. “The isomap algorithm and topological stability.” Science 295.5552 (2002): 7-7
14
Coifman, Ronald R., and Stéphane Lafon. “Diffusion maps.” Applied and computational harmonic analysis 21.1
(2006): 5–30
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.5 Unsupervised learning: Learning w
ithout labels 353
within the entire space represents the metric within the manifold, or an approximation
thereof. This is what geographical maps are: an embedding of the geodesic distance on the
earth’s surface, into a two-dimensional embedding space, so that we can use a ruler to mea
sure the surface level travel distance.
Generative models, such as those illustrated in Figure 12.14, are another approach in the
area of unsupervised learning. The aim is to create models that can actually generate real
istic new examples resembling those observed in the training data. Generative models are
trained on examples without labels, and assume an underlying but unobservable distribu
tion of these data. An explicit density model estimates this underlying distribution based on
the training data, and typically results in a parametric representation of this estimate, such
as a Gaussian Mixture Model (GMM). Training approaches for this type of model are, for
instance, based on maximum likelihood15. Alternatively, we may train a model that is able
to generate examples that are hard to discriminate from the observed examples stemming
from the training distribution. This would be referred to as an implicit density model.
As these models are relevant in different contexts, and a substantial amount of technolog
ical development is conducted to further advance them. This book contains a chapter specif
ically focused on generative artificial intelligence. Here, we will discuss a few specific meth
15
Le Cam, L. (1990). Maximum likelihood: an introduction. International Statistical Review/Revue Internationale de
Statistique, 153–171
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
354 12 Machine Learning
ods that form building blocks of this technology, while Chapter 16 goes into far more detail
on this topic.
■ 12.6 A
rtificial Neural Networks and
Deep Learning
In the following section we will discuss a family of models that has become a dominant ap
proach in classification, prediction, or the learning of structure from examples: artificial
neural networks. We will first discuss their primary building block – the perceptron. The
section that follows will describe how to construct, train, and use artificial neural networks
based on perceptrons.
The perceptron16 is motivated by biological neurons and works with two very simple mech
anisms: the merging of many input values, and an internal nonlinear function that maps the
resulting value to an output. The input values are scaled by weights, then summed up, and
the result is mapped to the output value by a nonlinear function, i. e., a function that is not
a linear mapping between input and output. This very simple mechanism enables both con
tinuous mapping, and classification. Figure 12.11 illustrates a perceptron.
(∑ + n+1)
1
igure 12.12
F
A perceptron processes input from
multiple other units that might be obser-
vations, or the output of other percep-
trons. The processing consists of a
n+1 weighted sum of the input values and a
subsequent nonlinear function.
16
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain.
Psychological review, 65(6), 386
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 355
The weights are factors with which each input is multiplied before all of them are summed
up. Another way of viewing them is as weights of the connections that scale the values
while they are transmitted from the output of one perceptron to the input of the receiving
one. In addition to the weights applied to the incoming values, each perceptron has an addi
tional additive value, the so-called bias. This is added independent of the input values.
During the learning process, the weights and the biases are changed and optimized in such
a way that with a given input feature vector, the output of the perceptron corresponds to the
desired class label of training examples or achieves this in as many cases as possible.
The perceptron serves as a building block in larger artificial neural network architectures,
when the output of one layer of neurons is used as the input of the next layer. Such networks
are also called multilayer perceptrons, a form of artificial neural networks (ANN), because
they consist of several layers. We divide these layers into an input layer with input units, a
number of hidden layers with hidden units and a final output layer with output units. In the
case of the multilayer perceptron, all of the units are perceptrons, and often we refer to them
as neurons.
When moving to higher layers, the number of perceptrons can be increased or decreased.
Later, we will see how this change of layer widths can be used to force a network to learn
about the inherent structure in a large number of observations.
In this framework, a classification process is typically modeled as a multilayer perceptron
which, in its last layer, contains a number of perceptrons equal to the number of classes. To
classify an input feature vector, its values are entering the multilayer perceptron in its first
layer. Here, the number of perceptrons is equal to the number of features. From there the
values are passed forward through the networks and its perceptrons, until they reach the last
output layer. The output neurons are used as indicator function, to determine the class the
network predicts. Each is assigned a value by the forward pass of the input values through
the networks, and the output perceptron with the highest value determines the class.
To train a multilayer perceptron we use a set of training examples with known feature vec
tors, and known labels. During training, feature vectors are the input to the network, while
labels determine the desired output. We pass the feature vector through the network, obtain
the class prediction of the network by determining which output perceptron has the highest
value, and compare this label with the desired label known in the training set. For instance,
for a five dimensional feature vector (1.3 0.3 2.5 1.2 5.1) the correct label among four
possible labels is ‘3’. The desired output of the network would therefore be (0.0 0.0 1.0 0.0),
but the actual output of a not-yet-perfect network may be (0.3 0.5 0.3 0.2). We would like
to change the weights so that ultimately the correct class “wins”, as would be represented
by an output vector such as (0.2 0.4 0.8 0.3), where the largest value is in the third position.
We can perform training using an algorithm called backpropagation17. This algorithm uses
the difference between the desired and current network output (for example, comparing
17
Hecht-Nielsen, R. (1992). Theory of the backpropagation neural network. In Neural networks for perception
(pp. 65–93). Academic Press
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
356 12 Machine Learning
(0.3 0.5 0.3 0.2) and (0.0 0.0 1.0 0.0)) to adjust the weights throughout the network. This is
achieved by starting from the output layer and translating the difference between the deliv
ered and desired vectors to gradients in the parameter space of a layer. The parameter space
consists of all the values we can change during training. Here these are the weights of the
connections and the biases of each perceptron. This is repeated until the input layer is
reached, and then repeated again with a new training example.
While an immense variety of models can be implemented with this architecture, neural
networks initially suffered from the very slow convergence of the optimization algorithms
with which they were trained, which also made it almost impossible to train very deep net
works. This even led to a time of relatively low activity in AI research: the so-called “AI
winter”18.
Figure 12.13 A multilayer perceptron consists of several layers of perceptrons. The illustration
shows the connections to the top perceptron in each layer as an example. During the training,
the weights of these connections are optimized using training examples in the form of pairs of input
and output values.
Figure 12.14 During the training, the network is given a large number of training examples in the
form of feature vectors (input vectors of the first network layer) and class memberships (coded as
values of the output layer). In this way the training algorithm learns the optimal values of the weights
of the network.
18
Schank, R. C. (1991). Where’s the AI?. AI magazine, 12(4), 38-38
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 357
Research on neural networks increased again around 2010 for three reasons. The amount of
available computing power had increased enormously and in particular hardware architec
tures such as those used in graphics cards turned out to be optimal computing hardware for
training neural networks. Second, some algorithmic advances had been made that were
nearer to solving the problems one had faced prior to the onset of the AI winter. For exam
ple, new methods to train networks without being hindered from the progress of the opti
mization by vanishingly small gradients during backpropagation had been developed. And
third, the availability of data, especially image data, increased dramatically. This led to the
ability to effectively train actually deep multilayer perceptrons on a large amount of data.
A rapidly growing field dealing with deep learning had emerged.
One of the first and most successful architectures of the new deep learning era was the
Convolutional Neural Network or CNN. CNNs are multilayer perceptrons with an architec
ture that leaves out a relatively large number of connections between the layers. The CNNs’
own connection pattern results in a behavior of the layers that corresponds to a filtering of
the input layer below. CNNs typically assume spatially structured inputs, such as images.
You need this property to make sure that the neighborhoods the filter is considering are
actually relevant. Here, the bottom layer can be thought of as simple edge filters. The second
layer works on the output values of the first layer and thus filters the filter responses to the
input — typically an image — according to the first filter bank. Each layer is connected with
the one below in such a way that the weights of the connections result in a series of filters
that are applied one after the other, each on the previous layer. In Chapters 17 and 18 CNNs
are discussed in the context of language processing and image processing, respectively.
The capability of CNNs to learn efficiently from large numbers of training examples rests on
several assumptions about the real world, and advances in the training strategy of artificial
neural networks. The first assumption is an observation about the visual world that sur
rounds us. CNNs and the repeated filtering – and downsampling – of input values reflect an
observation that can best be described as compositionality. That is, we are surrounded by
objects and observations made up of components that occur in different configurations. For
example, a table with a vase is not fundamentally different from a table with a plate. Both
overall objects share sub-objects (the table). The second assumption is that a filter – a com
ponent in the lowest layer of a CNN – that works on one part of an image, might work
equally well on a different part of the image. You don’t have to learn how to detect edges in
the lower left corner or in the upper right corner of an image independently. You can just
assume that you need an edge detector in your filter repertoire. Since you are adjusting
weights when training an artificial neural network such as a CNN, the strategy of learning
such a filter only once, and reusing it across the input image, amounts to weight sharing.
Finally, the decision of approaching the mapping from an input image to a label such as
“cat” via the use of filters exploits the assumption that neighborhood matters, and that the
input filters only need to be connected to a relatively small portion of the image, as long as
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
358 12 Machine Learning
we apply them analogously across the image – like a parsing of the image with a filter.
These three assumptions essentially reduce the number of connections for which weights
have to be trained, in contrast to a fully connected network. This reduces the parameters,
and accordingly makes training based on a set number of examples, in a set amount of time,
easier.
In a train of thought similar to the bagging strategy, CNNs invest the computing power not
in a single, very complex layer of perceptrons, but in the stacking of relatively simple filter
banks. While lower levels learn to respond to more primary visual properties such as con
trast, edges or light frequencies, layers above learn how to deal with compositions of these
elements. Thus, on the one hand CNNs can represent the visual variability present in the
training data, and on the other hand they become very good at mapping this representation
to the correct class affiliation.
In general the training of artificial neural networks aims at minimizing a loss function that
reflects how well the network performs. Typically, this is measured by how close the output
of the network comes to the desired output, when training examples are processed for which
input and output are known. Imagine the loss function as a landscape with hills and valleys.
The position coordinates of this landscape are the parameters of the network: the weights
we want to optimize. The height at each point of the landscape could then indicate the aver
age error the network makes, the loss. To train a model means to search for the deepest
point in the landscape, the coordinates of this point are the optimal parameters of the net
work. In real training situations we never see the entire landscape, but only a tiny neigh
borhood around the current position (which is defined by the current network parameters).
In each training step, we update the parameters based on one training example so that we
travel downhill in this tiny neighborhood, and then draw the next training example. In a
landscape, or, more formally, a function l(x) over the coordinates x, the gradient at each po
sition x is the direction in which l(x) increases most. We go the other way to move a smaller
loss value, and thus this type of optimization is called gradient descent optimization.
The training of CNNs, as it is done today, addresses a problem known as the vanishing gra
dient problem. If a training algorithm for a CNN adjusts weights by for instance backpropa
gation it compares the output of the CNN with the desired output given in the training data.
The difference between the two is used to update the weights in the last layer of the network.
Next, propagating the output back to the second-last layer again creates a difference be
tween output at that layer and desired output. This difference is again transferred into a
gradient, which is used to determine the degree and direction with which to update the
weights in this layer. This process is repeated until the first layer of weights is reached. In
theory this strategy converges at a good solution, but in practice the gradient vanishes as it
is propagated back through the layers. That is, the gradient — the weight updates — gets too
small to converge in any reasonable time. Correspondingly training in a limited time frame
becomes infeasible. Training CNNs can address this issue by initiating training in a layer-
wise fashion. In this approach, layers of the CNN are trained successively each with the
objective of representing the variability delivered by the previous layer when it processes a
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 359
lot of training examples. Only after layers are initiated this way, is supervised training for
the output labels performed on this pre-trained network.
The following strategy has proven to be successful in many different contexts: “first learn to
represent the variability of the training data, and then learn to link it with the desired target
variables”. It has led to an extremely useful approach called “transfer learning”, discussed
in detail in 17.4.4. In transfer learning, parts of ANNs trained to solve one problem are re
used as initialization for ANNs that are trained to solve a different problem. In image pro
cessing it has turned out that lower layers are surprisingly transferable across problems.19
A contrary problem to vanishing gradients is that of exploding gradients occurring in RNNs.
This occurs when gradients accumulate and result in very large updates to weights during
training, and a correspondingly unstable behavior of the network. Aside from revisiting the
network architecture itself, one strategy to remedy this is gradient clipping, where if a gra
dient becomes too large, it is scaled down. Exploding gradients are relevant in the training
of recurrent neural networks (see section below) where long short-term memory (LSTM) is
one of the strategies to counter them.
Overall, CNNs have led to an enormous increase in the performance of classifiers and re
gressors, which are particularly successful in areas where data with spatial structure is
processed.20
By now a wealth of CNN-based architectures exist, that enable scaling and improving this
type of models beyond the initial results. Early examples are LeNet introduced by LeCun in
199821, and AlexNet by Khriszevsky in 2012 that yielded outstanding results at the image
net competition at that time22. To achieve larger models with more and wider layers, sparse
CNN structures were approximated with a dense construction and so-called inception
modules, parallel, alternative convolution layers in architectures such as the GoogLeNet.23
ResNets were proposed to counter the vanishing gradient problem by using residual con
nections or skip connections that bypass layers, and transport the gradient to lower layers,
without getting diluted.24 DenseNets advance the concept of ResNets and connect all layers
with all other layers by skip connections, while each layer is relatively narrow, adding only
a small feature set to the resulting collective models. The final classifier then has access to
all feature maps across the entire model.25
19
Yosinski, J., Clune, J., Bengio, Y., & Lipson, H. (2014). How transferable are features in deep neural networks?
Advances in Neural Information Processing Systems, 27, 3320–3328
20
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural
networks. Advances in neural information processing systems, 25, 1097–1105
21
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324
22
Krizhevsky, Alex, Ilya Sutskever, and G. E. Hinton. “ImageNet classification with deep convolutional neural
networks. Advances in Neural Information Processing Systems 25 (NIPS 2012).” (2012): 1097–105
23
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A.
(2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern
recognition (pp. 1–9)
24
He, Kaiming, Xiangyu Zhang, and Shaoqing Ren. “Deep Residual Learning.” Image Recognition (2015)
25
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In
Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
360 12 Machine Learning
Recurrent neural networks (RNN), depicted in Figure 12.15, are a family of network archi
tectures that are particularly well suited to processing sequential data such as speech, writ
ing or sound signals. Like the previously discussed architectures, an RNN consists of input
units, hidden units and output units. In the case of RNNs, each input at each timestep is
used to calculate an output for that timestep, but the output is not only based on the input
word: the hidden state of the previous calculation based on the previous input is also used.
This means that, as processing progresses, the hidden state builds up a representation of
the entire sequence processed so far. Consider the example of translating the sentence “Die
Hauptstadt Österreichs ist Wien” into “The capital of Austria is Vienna”. By the time the
RNN begins processing the input word “Vienna”, the “context” of the previous words can
already provide a good predictive contribution to the translation of that word into its Ger
man equivalent, “Wien”. Successful areas of application are language processing, transla
tion or any processing of sequential data whose individual components or points in time are
related to one another. Sequence to sequence learning is explained in the context of lan
guage processing in Section 17.4.3.
hidden-unit1 hidden-unit 2
output1 output 2
input1 input 2
Figure 12.15 Recurrent neural networks: In each step, the output is determined on the one hand
by the current input and on the other hand by the previous “hidden state” of the network. Hidden
states can be understood as a memory that contains information about previous calculations and
inputs. Since the architecture is always the same, the network can be visualized as shown on the
left, while the representation on the right is often referred to as “unfolding”, which is only possible
in full in the case of a finite RNN.
How do we represent words? The two most common approaches are to use “one hot” or “dis
tributed word vector” representations. In a one-hot representation, a vector is used with a
length corresponding to the number of words in the training vocabulary. That is, every word
from every training document has a place. A word is encoded in such a vector by marking a
1 in the slot for that word and a 0 for all other word slots. Distributed word vectors, such as
word2vec26,are an alternative and very effective representation that has become more and
more widespread recently. They are created by embedding words in what is known as an
embedding space, in which the distance between word vectors reflects their semantic rela
tionship.27 Thus, words with similar meanings will be clustered together, whereas those
26
https://www.tensorflow.org/tutorials/representation/word2vec
27
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 361
with completely unrelated meanings will be spread further apart. Apart from creating a
meaningful distance in the representational space, it has several other advantages over a
one-hot representation that are explained in Section 17.3.2 in this book.
In principle, RNNs can also be used in cases that do not contain an obvious sequence, such
as when processing images. Here, identifying the sequence becomes one of the training
tasks, like an RNN reading house numbers and learning how to sequentially control the
visual attention to accomplish this task effectively.28 A good and critical overview of RNN
architectures is given by Lipon et al. 2015.29
Standard RNNs, like other architectures, suffer from the problem of the vanishing gradient
during training. Recall that this means the corrections that change the weights in the net
work during learning, when comparing the network result with the desired target value
through backpropagation, become smaller and smaller the deeper one goes back in the
network. In the case of RNNs, this makes it more difficult to learn relationships over longer
periods of time: The influence of earlier inputs is “forgotten”, the further one progresses
through the input sequence.
One technique to improve this behavior and to use both short-term and long-term memory
in the network consists of so-called long short-term memory (LSTM) cells (Figure 12.17).30
The LSTM architecture is found today in a large number of applications that solve language
processing or translation problems. The idea here is not just to simply take over the cell
status and link it to the new observation, but to control its change and influence in a more
differentiated manner and to learn this control with the help of the training examples. The
28
Ba, J., Grosse, R., Salakhutdinov, R., & Frey, B. (2015). Learning wake-sleep recurrent attention models. arXiv
preprint arXiv:1509.06812
29
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning.
arXiv preprint arXiv:1506.00019
30
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
362 12 Machine Learning
cell status is carried over from the previous step and the current observation. Together with
the old output, it controls the forgetting of this status by multiplying it by a number from 0
to 1. The new observation then in turn contributes to the change in the cell status and only
then are the cell status and new observation used together in order to calculate the output
at the current point in time.
A very good explanation of the architecture, upon which Figure 12.17 is based, comes from
Christopher Olah31.
× + × +
tanh tanh
× × × ×
tanh tanh
× + × +
tanh tanh
× × × ×
tanh tanh
× +
tanh
× × 5. The output of the LSTM cell
is composed of the processed
tanh
new observation, and a contribution
of the cell state.
Figure 12.17 The components of an LSTM cell. The change in the cell status is controlled by three
aspects: by a new observation, by forgetting the old status, and by the influence of the status itself
on the new output of the cell (Illustration based on C. Olah: “Understanding LSTM Networks”, 2015).
31
Olah, C. (2015). Understanding lstm networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs/ last
accessed Nov 2021
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 363
We have learned about CNNs that translate an input image to an output target that might be
either just a single label, or a short representative vector. A different approach is based on
the idea of autoencoders.3233 There, input data is passed through a number of encoder layers
to finally result in a low-dimensional representation of the input. The right-hand side of the
network, also depicted in the right side of the autoencoder in Figure 12.18, is composed of
decoders that map this low dimensional representation back to data resembling the input.
The network is trained based on a loss function that penalizes differences between input
and output. The constraint that the network has to pass the information through a narrow
bottleneck layer fosters an efficient representation of the variability in the training data.
Similar to CNNs, this principle can be applied to image style data, where encoders and de
coders consist of convolution and deconvolution layers.34 This architecture can serve two
purposes. First, it can be used to create a low dimensional embedding of the training data,
that itself is the desired result of the training. In many examples the need to represent vari
ability efficiently leads to an embedding in the bottleneck layer, whose neighborhoods rela
tionships are meaningful: Examples that are close by share feature characteristics. These
representations could then be used for clustering, as described in Section 12.5.1. The sec
ond use of this type of architecture is to map input images to an image representation such
as a label map. This is the kind of usage applied in image segmentation. Here, the training
is not based on single images, but on images paired with the known label maps in the train
ing set.
In practice the latter type of mapping led to workable but fuzzy mappings, as the level of
detail that was transported through the autoencoder suffers. To counter this, there are two
strategies that are now wide-spread and have led to a tremendous expansion of the role of
these image-to-image type mapping models in practice. First, the loss of detail can be coun
tered by skip connections, which connect layers in the encoder and decoder part of the
network. The input is parsed through the usual sequence of convolution: down-sampling,
deconvolution, and up-sampling layers that learn common structural elements on different
levels. At the same time the signal is also passed on directly to the opposite layer. This en
ables the preservation of detail, as the direct input from the skip connection and the input
coming from previous deconvolution and up-sampling layers are combined (Figure 12.18).
This class of networks, called U-nets due to the typical visualization of the layers, was pub
lished in the context of labeling microscopy imaging data.35
The second component of strategies to train models that map from inputs to complex out
puts such as label maps or images concerns the training, and is described in the next sec
tion.
32
Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.”
science 313.5786 (2006): 504–507
33
Baldi, P. (2012, June). Autoencoders, unsupervised learning, and deep architectures. In Proceedings of ICML
workshop on unsupervised and transfer learning (pp. 37–49). JMLR Workshop and Conference Proceedings
34
Masci, J., Meier, U., Cireşan, D., & Schmidhuber, J. (2011, June). Stacked convolutional auto-encoders for
hierarchical feature extraction. In International conference on artificial neural networks (pp. 52–59). Springer,
Berlin, Heidelberg
35
Çiçek, Ö., Abdulkadir, A., Lienkamp, S. S., Brox, T., & Ronneberger, O. (2016, October). 3D U-Net: learning dense
volumetric segmentation from sparse annotation. In International conference on medical image computing and
computer-assisted intervention (pp. 424–432). Springer, Cham
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
364 12 Machine Learning
In the previous section we briefly touched on the issue that a loss function – the criterion
with which we train an artificial neural network – can have trouble capturing all that we
want the network to learn. If we penalize the sum of all differences between the input and
the output of an autoencoder, we might end up with a model that generates the coarse im
age, but that skips small details, because in terms of the loss function and the resulting
gradient they do not have a big impact. At the same time, if we create a loss function that
specifically captures small details, the training might have trouble learning anything at
the very beginning. In short, adapting the loss function while we are training is a strategy
worth looking at.
( , )
Figure 12.18 Autoencoders map images to images. They consist of an encoder and a decoder
connected by a low-dimensional bottleneck. U-nets contain skip-connections that map directly from
encoder to decoder circumventing this bottleneck. Training of this type of networks can either rely
on the difference between input and output or on an adversarial discriminator.
Adversarial training is such an approach. Instead of using the difference between the actual
output and the desired output as a means to steer the gradient during training, we train an
adversary to the mapping network. The adversary is a discriminator that determines
whether the output is an actually observed training example, or an output generated by the
model. As we train our main model, we also update the adversary, so that while the main
model becomes better at generating output that resembles real data, the adversary becomes
better at discriminating between model-generated output and real examples. Thereby, it can
provide a useful gradient throughout the training, while both model and discriminator be
come better. Figure 12.18 illustrates this scheme. The autoencoder or U-net maps from an
input image c to an output image x. The training of this model can either be driven by a loss
function capturing the difference between the output x and the desired output sampled from
the training data or by a more sophisticated mechanism based on an adversarial discrimi
nator. In adversarial training, the discriminator learns to discriminate between a generated
output x, given input c as a so-called “condition”, or a genuinely observed desired output
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 365
sampled from the training pairs in the form of c and x. Adversarial training has led to im
pressive models translating images to label maps, or even the other way around.36
For a long time a limitation of deep learning approaches was the lack of generative models
that could capture the variability in real world data, and at the same time work with low
dimensional representations in the generator space. Autoencoders are to some degree us
able, since, after training, the decoder in principle maps from a low dimensional generator
to the actual observation space of images.
A leap in capability came when generative models met adversarial learning in the seminal
work of Goodfellow and colleagues, who proposed Generative Adversarial Networks (GAN).37
Figure 12.19 illustrates the scheme of GAN training. For imaging data, one can imagine the
generator as a forger that is working on generating realistic images, by sampling vectors z
out of a uniform distribution in the latent space, and running these vectors through a decon
volution and upscaling network that is the generator. At the same time we train a discrimi
nator to correctly classify an image into two categories: real example, or generated example.
You can imagine this as the police, trying to catch the forger. The reward function of the
generator is determined by the discriminator, by penalizing the generation of examples that
are correctly detected as forgeries by the discriminator. While training progresses, both
generator and discriminator get better. The generator defines the model distribution implic
itly through the examples it generates. If generator and discriminator models are differen
tiable — that is, if we can formulate a gradient function as explained in Section 12.6.2—this
can be implemented in a very efficient fashion. Training progresses in two gradient steps:
updating the discriminator to get better at detecting forgeries, and updating the generator
to get better at fooling the discriminator. One can view this training as a game between
generator and discriminator, with a solution reached in the so-called Nash equilibrium
where each player wishes to minimize the cost function that depends on parameters of both
players while only having control over its own parameters.
36
Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134
37
Goodfellow, I. et al. (2014). Generative adversarial nets. Advances in neural information processing systems, 27
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
366 12 Machine Learning
Generator Discriminator
Is it real or generated?
Training examples
Figure 12.19 Generative adversarial networks (GAN) are trained to generate examples that are
indistinguishable from real world training examples. The cost function they are trained with is based
on a discriminator that learns to discriminate between generated (“fake”) examples and real
examples sampled from the training data. As training progresses, both improve, until they reach a
so-called “Nash equilibrium”.
Not unlike in the case of the autoencoder, this leads to useful properties of the representa
tions in the latent z-space. One example is that we can perform vector arithmetic such as
adding and subtracting vectors in the z-space and obtain seemingly meaningful results in
the observation space.38
GANs have turned out to be very powerful in representing rich variability in the training
data. Thus, one application of GANs is anomaly detection, where a GAN is trained to gener
ate medical imaging data of healthy anatomy, and the residual between the genuinely ob
served new images, and the best approximation generated by the GAN, is used as a marker
of anomalies potentially linked to disease.39 Another application area is the translation be
tween different observational spaces such as text and images40 or even languages41.
Training GANs can be difficult, and a number of new approaches to facilitate this training
have recently emerged. One of the possible limitations of GAN training is a situation resem
bling a catch me if you can game called mode collapse. In this case, the generator becomes
very good at generating one particular example, and then when the discriminator catches
up, the generator switches to a new example and repeats this strategy. Mode collapse is easy
to recognize, as after training the generator can only generate a few different examples, and
traveling in the latent space does not lead to a continual transition between generated im
38
Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks.” arXiv e-prints (2015): arXiv-1511
39
Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., & Langs, G. (2017, June). Unsupervised anomaly
detection with generative adversarial networks to guide marker discovery. In International conference on
information processing in medical imaging (pp. 146–157). Springer, Cham
40
Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017). Stackgan: Text to photo-realistic
image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference
on computer vision (pp. 5907–5915)
41
Conneau, A., Lample, G., Ranzato, M. A., Denoyer, L., & Jégou, H. (2017). Word translation without parallel data.
arXiv preprint arXiv:1710.04087
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.6 Artificial Neural Networks and Deep Learning 367
ages. One learning strategy which reduces the risk of mode collapse, as well as tackling
vanishing gradients in GAN training, is to replace the discriminator with a critic, that in
stead of only penalizing falsely classified cases, provides a critique, that also provides a
gradient to the training if a correctly classified image is generated. The theory for this ap
proach was proposed in Arjovsky, Martin, and Léon Bottou42, and its applications to GANs
was proposed in Arjovsky, Chintala and Bottou43. It has also been used to improve anomaly
detection in medical imaging data.44
Methods based on GANs have advanced rapidly, and here we discuss two particularly rele
vant examples. Cycle GANs are an approach to learn a mapping between domains, even if
no paired examples are available.45 Cycle GANs can learn to map between imaging charac
teristics of different scanners in medical imaging, even if no paired imaging data is avail
able. Such a situation is realistic in practice, since studies scanning patients multiple times
in different scanners for the sole purpose of model training are rare and often limited.46
Applications reach beyond imaging data to, for instance, language translation models
trained from large so-called corpora of language data but without any ground-truth sen
tence pairs (known as parallel corpora).47 The intuition of cycle GANs is that even-though no
paired examples are available, the structure in the two domains is similar enough that a
representation of their variability can be essentially aligned. Imagine two languages, and
the use of words in the context of other words: If we define neighborhood relationships
based on the co-occurrence of words in close proximity in texts, the resulting representation
of the overall structure of the language might exhibit sufficient similarity to align these
manifolds.
The second hugely influential extension of GANs are so-called Style GANs. These networks
split the generation into two steps, first a mapping from a latent space into an intermediate
space, and then a mapping from this intermediate space to the final actual observations.
The second step injects the intermediate values to different layers of the synthesis net
work, and thereby enables a more independent steering of different aspects of the generator
compared to standard GANs. Style GANs allow for the control of the generator at different
levels of detail corresponding to layers in the generator. To some extent this enables the
42
Arjovsky, M., Bottou, L. “Towards principled methods for training generative adversarial networks.” arXiv preprint
arXiv:1701.04862 (2017)
43
Arjovsky, M., Chintala, S., & Bottou, L. (2017, July). Wasserstein generative adversarial networks. In International
conference on machine learning (pp. 214–223). PMLR
44
Schlegl, T., Seeböck, P., Waldstein, S. M., Langs, G., & Schmidt-Erfurth, U. (2019). f-AnoGAN: Fast unsupervised
anomaly detection with generative adversarial networks. Medical image analysis, 54, 30–44
45
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent
adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232)
46
Seeböck, P., Romo-Bucheli, D., Waldstein, S., Bogunovic, H., Orlando, J. I., Gerendas, B. S., Langs, G., & Schmidt-
Erfurth, U. (2019, April). Using cyclegans for effectively reducing image variability across oct devices and
improving retinal fluid segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging
(ISBI 2019) (pp. 605–609). IEEE
47
Lample, G., Conneau, A., Ranzato, M. A., Denoyer, L., & Jégou, H. (2018, February). Word translation without
parallel data. In International Conference on Learning Representations
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
368 12 Machine Learning
decoupling of different characteristic levels, as long as they correspond to layers in the gen
erator.48
■ 12.7 T
ransformers and Attention
Mechanisms
Natural Language Processing (NLP) has motivated a host of machine learning models.
Among those, so-called “Transformers” have had a huge impact on the entire field of ma
chine learning, and the capability of machine learning models, in general. Initially intro
duced in the context of language processing and sequence to sequence translation, trans
former models use an “attention mechanism” to learn relationships among input data.
Using attention to guide which combinations of input data inform the output has several
advantages over architectures such as RNNs, LSTM, or even CNNs. Consequently, the ap
proach has had impact beyond language processing, such as image analysis.
This section provides an introduction into the model architecture, and an intuition about the
capabilities of attention mechanisms. For a more detailed description of their role in foun
dation models or natural language processing, please refer to Chapters 15, and 17, respec
tively.
In 2017 Vaswani et al. proposed the transformer architecture49, which drastically improved
machine learning models’ abilities to represent and exploit complex long-range relation
ships in data, and greatly scaled their effective encoding capacity. Similar to autoencoders,
Transformers are encoder-decoder models that encode an input to a latent representation
and then decode this representation to an output. However, there are important differences,
and in the following we will explain them one by one.
Transformers take a sequence of so-called tokens (think of tokens as words, parts of words,
or any element in a data sequence) and encodes them to a “latent representation”, which
represents the individual tokens, together with an embedding of the entire sequence. The
decoder takes this representation as input and produces a sequence of tokes as output. This
is similar to RNNs and LSTMs, and the intuition we have built there will help us understand
the innovation of transformers. Note that the principle of breaking sequential data into
tokens can also be used to represent other types of data, such as images, where tokens can
be pixels, or small image patches. But for now, let’s stay with the intuition of processing a
48
Karras, T., Laine, S., & Aila, T. (2019). A style-based generator architecture for generative adversarial networks. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4401–4410)
49
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia
Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30 (June).
https://doi.org/10.48550/ARXIV.1706.03762.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.7 Transformers and Attention M
echanisms 369
sequence of words. Let’s think of the transformer as a model that, for instance, translates
from one language to another.
Let us first focus on the encoder. A transformer consists of many (N) encoder blocks. The
first novelty is how these encoder blocks represent the relative position of tokens. CNNs,
RNNs, or LSTMs have a sense about the relative position of different input elements (such
as pixels, letters, or words) because they are inputs of different parts of the model. The
neighborhood relationship of two adjacent pixels is known to the CNN because they are read
at adjacent input neurons of the network. The relative position of subsequent words in a
sentence is known to an RNN because its recurrence mechanism reads one word after the
other. Making these relationships part of the hard-wired architecture is efficient, but limits
the range and complexity of relationships a model can learn. An alternative is to provide
inputs together with a positional encoding. Together with a data vector, typically an initial
embedding representing the input, a second vector that represents its position forms an
information pair of token data and token position that is processed by the model.
In practice, the positional encoding of the token is simply added to the data representation
vector before the resulting combined representation vector is further processed by a stack
of encoder layers that form the encoder. Each encoder layer consists of two sub-layers. First,
a multi-head self-attention mechanism, then a position-wise fully connected feed-for-
ward network.
The first step in each encoder layer is the multiheaded self-attention mechanism. Atten
tion itself is nothing new, it is a simple weighting of input values. Here, we use it to learn to
which extent the processing of different tokens is relevant for deriving the final output. The
encoder is fed all tokens together with their positional encodings at the same time. The at
tention mechanism is trained to guide the encoder to take specific combinations of tokens –
either because of their position, or their value – into account together. That is, the attention
mechanism enables the network to appreciate relationships among input tokens that matter
for forming the correct output token sequence. It is a central contribution of Vaswani et al.’s
paper, fittingly titled “Attention is all you need”. The way it is achieved draws on the intuition
of finding similar data points, and increasing the weight of their mutual influence if this
similarity is high. That is, the network learns to recognize groups of inputs whose relation
ship is informative for forming the correct output sequence.
Let us have a brief look at the technique used to accomplish this, before discussing what it
means in practice. To train the module that determines the network’s attention to specific
groups of words at specific relative positions, each input (a position-encoded token) is pro
jected into three spaces. The resulting vectors in these three spaces are called the query,
key, and value matrices. We perform this projection h times in parallel – hence the term
“multi-head” – each time with independently learned projection matrices to produce h trip
lets for each input. Now, these values are used to calculate a “scaled dot-product attention”
between pairs of query and key, and the result in turn is used to calculate weighted averages
of the third component, the values.
The resulting output is simply added to the input of the attention sub-layer, and the result is
normalized, and forms the input to the second sub-layer, the position-wise feedforward
network. For each position independently, this network receives the weighted input of the
attention layer. Its output is again added to its input and the result is normalized, forming
the output of the encoder block. In transformers, multiple encoder blocks are stacked on
top of each other, each receiving the output of the former as input.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
370 12 Machine Learning
The decoder of a transformer uses a broadly similar architecture to the encoder. The differ
ence is that, during training, part of the data is masked, so that only past positions can be
taken into account when creating the output. This is accomplished by a masked multi-head
attention layer in each decoder block.
After this rather technical description, let us have a brief look at the intuition of the atten
tion mechanism in transformer models. How does a transformer encoder with an attention
mechanism differ from the encoding of, for instance, a CNN? In a CNN, the relationships
that matter are hard-coded in the filter-size of the convolutional layers. In CNNs, the filters
are trained by adapting the weights of perceptrons whose input comes from a patch of pix
els. An example of such a filter reads from a 5-by-5 pixel patch in the input image. Only pixel
values at positions covered by the same kernel are considered together to form the output
of the layer. The values of the filter learned during CNN training determine how values of
different pixels in a kernel are combined to form its output. One of the rationales of using
local filters in CNNs was “if it is close by it is probably related”. Transformers process a
much larger input than a single kernel, such as a long text, or an entire image. The attention
sub-layer learns to recognize groups of tokens, distributed across this input, which should
be combined to create the output sequence. It may learn this because they are close by (that
is, their positional encoding suggests that they are related), or because the token values
themselves suggest they are related, or both. By learning the projection matrix based on the
input data, the network has more capacity to identify and exploit those relationships, even
if they consist of many components, and span a wider position range in the input data.
50
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, et al. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.”
arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2010.11929.
51
Hatamizadeh, Ali, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, and Pavlo Molchanov. 2023.
“FasterViT: Fast Vision Transformers with Hierarchical Attention.” arXiv [cs.CV]. arXiv. http://arxiv.org/
abs/2306.06189.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.8 Reinforcement Learning 371
three-dimensional voxels. They are used for image registration52 in magnetic resonance im
aging, and results demonstrating the feasibility of predicting the response to cancer treat
ment53 have shown the utility of transformers in multi-modal data analysis.
Later in this book, several areas of transformer application are described in detail, including
in Foundation Models (Chapter 15), Generative AI (Chapter 16), and Natural Language Pro
cessing (Chapter 17).
However, despite impressive results, several limitations remain. The capacity of reasoning
by machine learning models, including transformers, is still a largely unsolved problem.
Quality control in, for example, large language models, is also hard. This can limit their
applicability with respect to critical tasks, such as medical care.
52
Xu, Junshen, Daniel Moyer, P. Ellen Grant, Polina Golland, Juan Eugenio Iglesias, and Elfar Adalsteinsson. 2022.
“SVoRT: Iterative Transformer for Slice-to-Volume Registration in Fetal Brain MRI.” arXiv [eess.IV]. arXiv. http://
arxiv.org/abs/2206.10802.
53
Vanguri, Rami S., Jia Luo, Andrew T. Aukerman, Jacklynn V. Egger, Christopher J. Fong, Natally Horvat, Andrew
Pagano, et al. 2022. “Multimodal Integration of Radiology, Pathology and Genomics for Prediction of Response to
PD-(L)1 Blockade in Patients with Non-Small Cell Lung Cancer.” Nature Cancer 3 (10): 1151–64. https://doi.
org/10.1038/s43018-022-00416-8
54
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
372 12 Machine Learning
Now that we have introduced some key terms, let us dive into reinforcement learning in
more detail. The environment in RL is often modelled in terms of Markov decision processes
(MDPs), a common framework for modeling decision making in situations where an out
come is only partly influenced by a decision maker, and partly by an external random effect:
the environment. We use S to denote the set of possible environments and states in which
the agent finds itself and A the set of the agent’s possible actions. For example, in the case
of an autonomous vehicle, the state S would be described by the position of the car relative
to the center of the lane and the distance from the car in front. The set A would represent
actions including acceleration, braking, turning left, or turning right.
The RL agent interacts with the environment in time steps. At each timestep t, the agent has
a state st, and a selection of actions at to choose from. Having chosen, the decision maker
will be moved into a new state st+1 influenced by the chosen action and the current state,
together with a random element. More formally the new state is decided by a probability
function whose distribution is determined by current state and chosen action. At the same
time it is given a reward rt+1, based on the transition (st, at, st+1). The probability of moving
into the new state, given the current one and the chosen action, is denoted by a state tran
sition function Pa(s, s’). This probability is conditionally independent of all previous states
and actions, given s and a.
As in real life, the immediate reward for an action is not the full story. Our aim is often to
train an agent so that it accumulates rewards over time, instead of only getting the biggest
possible reward right now, at the cost of possibly larger rewards later. For instance, imagine
you were an agent being trained to get from A to B in a city as fast as possible. We design a
reward function in which the reward is high for any timestep (or, more specifically, for any
move within a timestep) which reduces the distance to B. You will learn to navigate a city on
foot, using paths as straight as possible and always in the direction of B. That’s fine, but
what if, a few steps in the opposite direction to B, there was a bus stop with connections
straight to B? How can you make an agent learn to utilize such possibilities, too?
The key idea in RL is thus to optimize the policy for rewards accumulated over a number of
steps in the future, instead of only a single step, assessing the value of states and the value
of actions given a specific state.55 To venture onto paths that do not bring immediate reward,
but accumulate reward over time, training has to involve trade-offs between the exploitation
of an agent’s current policy (“go for the immediately largest reward”) with exploration of
actions that do not bring immediate optimal reward (“lets see what happens later”). While
training is running with this trade-off the agent records the accumulated gain of pairs of
states and actions — the policy — and thereby learns to take those actions with the optimal
long-term value. A more extensive explanation can be found in the book “Machine Learn
ing” by Tom Mitchell56 which is available on the authors web page at Carnegie Mellon Uni
versity 57. The topic is still an area of very active research58, and deep reinforcement learn
ing has furthered the capability of this family of approaches59.
55
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279–292
56
Mitchell, T. (1997). Machine learning
57
http://www.cs.cmu.edu/~tom/mlbook.html (last accessed November 2021)
58
Liu, Q., Yu, T., Bai, Y., & Jin, C. (2021, July). A sharp analysis of model-based reinforcement learning with self-play.
In International Conference on Machine Learning (pp. 7001–7010). PMLR
59
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., & Pineau, J. (2018). An Introduction to Deep
Reinforcement Learning. Foundations and Trends in Machine Learning, 11(3-4), 219–354
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.8 Reinforcement Learning 373
We will finish this section with a prominent example which illustrates how reinforcement
learning overcomes the challenges of previous approaches, specifically, learning policies
via supervised learning. AlphaGO is a deep learning model that made headlines by playing
the game “Go” as well as, or better than, the best human players.60 During a game of Go, in
contrast to chess, the number of possible moves a player can make increases. RL is the only
feasible approach to train an agent to play the game. In the AlphaGo model, the state is the
current board position, and the action is the move the agent makes. You should recognize
from the introduction above that we still need a policy, which is usually a probability dis
tribution of making a certain move, given a certain state. In the case of AlphaGo, the policy
is actually a policy network that determines the next move. It was in learning this policy
network for AlphaGo that RL was able to shine, where supervised approaches failed.
Before work on reinforcement learning began, the policy network for AlphaGo was initiated
by supervised learning, resulting in a supervised learning (SL) policy network. Data for
supervised learning was generated by simply downloading 30 million board positions and
the corresponding next moves by human players from the popular KGS Go Server. The SL
policy network was trained to predict the next move of a human given a board position. It
predicted expert moves with an accuracy of slightly more than 55%, which was at that time
already better than the state-of-the-art. But then, AlphaGo’s developers began experiment
ing with a new, RL-based approach: letting different, intermediate versions of the SL policy
network play against each other, to keep the pool of policies in the training diverse. The re
ward function was surprisingly simple: 1 if the move wins the game right now, –1 if the
move loses the game right now and 0 for all other moves. The RL policy network weights
were then updated at each time step to maximize the expected outcome of the game. Finally,
a value network was trained to estimate a value function of a state (that is, a board position),
as the expected outcome of the game if the agent were to play perfectly from that point on.
The policy network was initially trained on a huge number of state-outcome pairs. However,
this did not work well when sampling from real data: the board positions were so similar to
each other the model experienced overfitting, which is to say that the number of actually
independent training examples was far lower than the samples board positions. Thus,
instead of using real games, the developers generated training data by letting the RL policy
networks play against each other, generating about 30 million different positions sampled
from different games. The result was a fairly well playing policy network, and a value net
work that could judge board positions, in terms of the probability of winning, quite well.
In the last step, these two networks were combined, to improve the policy network even
further.
The story of AlphaGo clearly demonstrates that the ability to simulate data is key, and that
often simulated data outnumbers data available in the real world, and can even supersede
its value for training, due to its diversity. Consequently, the following year the authors went
a step further: they did away completely with any initialization based on observing humans,
instead using RL for generating the entirety of their training data.61
60
Silver, David, et al. Mastering the game of Go with deep neural networks and tree search. nature, 2016, 529. Jg.,
Nr. 7587, S. 484–489
61
Silver, David, et al. Mastering the game of go without human knowledge. Nature, 2017, 550. Jg., Nr. 7676,
S. 354–359
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
374 12 Machine Learning
■ 12.9 O
ther Architectures and Learning
Strategies
In addition to RNNs, CNNs and transformers, there are a large and rapidly growing number
of different network architectures and their corresponding learning strategies. Primary
sources for keeping up to date are the Neural Information Processing System Conference
(NeurIPS)62, the International Conference of Machine Learning (ICML)63 or the International
Conference of Learning Representations (ICLR)64. Almost all publications can be accessed
with Open Access and algorithms are typically implemented very quickly in the common
libraries or made available in other ways.
Learning strategies can particularly focus on the number of examples, and the transfer of
structure from comparable domains is resulting in a diverse set of approaches gaining inter
est. What these approaches have in common is that they tackle situations of limited training
data. They include families of techniques such as few-, one-, or even zero-shot learning.65
62
www.nips.cc
63
www.icml.cc
64
www.iclr.cc
65
Kadam, S., & Vaidya, V. (2018, December). Review and analysis of zero, one and few shot learning approaches.
In International Conference on Intelligent Systems Design and Applications (pp. 100–112). Springer, Cham
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
12.11 Conclusion 375
we split the development data into five equally large subsets. We can either select them
randomly or stratify them based upon their properties, such as labels, or data sources. Then,
during cross validation we cycle through the five folds, and in each cycle, we use one of
them as validation data, and the remaining four others as training data. Thereby every
example is used for validation once, but for each run, there is no overlap between training
and validation data. The final cross validation results can be a good approximation of the
capability of the model and serve as a good source to select parameters, or perform design
decisions. Further subdivisions can be done if enough data is available. For instance, one
can keep the validation set fixed, but instead of using it for parameter optimization of a
particular model, we perform parameter optimization in a cross-validation fashion on the
data, excluding this validation data.
Finally, after design and parameter optimization is finished, the model’s capability is re
ported based on results on the test set. This is necessary since even if the validation set is
not seen during the training of a particular model variant, the observation of validation
results by the developer and subsequent design decisions will still lead to overfitting on this
data.
In the health care area, it has become common practice to require validation on an “external
validation set”, which comes from a different source than the data used for the model devel
opment. This is motivated by the insight that models can pick up particularities of individ
ual clinical centers, and therefore generalize only poorly to other centers. An excellent over
view of this topic is given by Glocker and colleagues66.
Further consideration during validation are matters such as fairness, biases and non-caus
ative factors that might enter the model training, leading to undesirable replication of
sub-optimal decisions that are present in the training data, but not linked to actual relation
ships we want to model. An example are biased treatment decisions, based on uneven
knowledge regarding the effectiveness of treatment in different sexes, which we don’t want
to perpetuate via the use of machine learning models. Here, the first step is to identify these
biases with careful validation designs, and then to develop strategies to counter them in
machine learning models.67
■ 12.11 Conclusion
This chapter gives an overview of the basic approaches and methodological areas of ma
chine learning relevant to data science. Both classification and regression can be viewed as
a mapping of input values to target variables. The two most important components of the
applicable algorithms are feature extraction and the prediction model, which maps the fea
ture vectors to a target variable.
66
Castro, Daniel C., Ian Walker, and Ben Glocker. “Causality matters in medical imaging.” Nature Communications
11.1 (2020): 1-10
67
McCradden, Melissa D., et al. “Ethical limitations of algorithmic fairness solutions in health care machine
learning.” The Lancet Digital Health 2.5 (2020): e221-e223
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
376 12 Machine Learning
■ 12.12 In a Nutshell
Machine Learning
A family of techniques that create models for prediction, classification or
estimation based on training examples instead of hard-coding a set of rules
during development
Supervised Machine Learning
Machine learning based on training examples in the form of pairs consisting
typically of an input and the corresponding output. During training the model
learns to predict the output for new inputs.
Unsupervised Machine Learning
Machine learning based on training examples without known output. Here,
the aim is to find structure in the data in the form of groups of examples,
or relationship networks.
Deep Learning and Artificial Neural Networks
Models consisting of deep artificial neural networks with a large number of
layers. These models can be designed in a wide variety of architectures.
They range from classification networks such as convolutional neural net
works, to networks that map between images such as autoencoders, and
generative models such as generative adversarial networks.
Transformers
Encoder-decoder models that use attention mechanisms for learning long-
range relationships in data. They are used in applications such as language-,
image- or multi-modal data analysis.
13
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
“We propose that a 2-month, 10-man study of artificial intelligence be carried out during the s ummer
of 1956 at Dartmouth College in Hanover, New Hampshire. The study is to proceed on the basis of the
conjecture that every aspect of learning or any other feature of intelligence can in principle be so
precisely described that a machine can be made to simulate it. An attempt will be made to find how to
make machines use language, form abstractions and concepts, solve kinds of problems now reserved
for humans, and improve themselves. We think that a significant advance can be made in one or more
of these problems if a carefully selected group of scientists work on it together for a summer.”
John McCarthy, Marvin Minsky, Nathaniel Rochester and Claude Shannon in 1955
What is AI and how is it different from simply creating machine learning models?
What does it take to create a great AI product?
What are the common traps when designing and developing an AI, and how can
you avoid those traps?
■ 13.1 H
ow AI Relates to Data Science and
Machine Learning
You may be asking yourself, what is the function of a chapter on Artificial Intelligence (AI)
in a book on data science? Often, AI is understood as just a fancy name for machine learning
models, models that data scientists build anyway as part of their job. If that were the case,
AI would simply be a part of data science and there would be no need to write a separate
chapter on AI as the rest of the book is all about that technology. Well, this is not exactly
correct. Although it is true that one of the most important – and perhaps most juicy – parts
of AI is in the machine learning models, there is a lot more to AI than just machine learning.
There are a few critical considerations that one needs to keep in mind when developing an
AI product; considerations which you will not normally find covered in a typical data sci-
ence book, or indeed, even in other chapters of this book. Critically, if you make a mistake
in one of these areas, your final product may disappoint. For example, you may run into a
situation in which everything seems fine early in the process of creation, but the final prod-
uct is underwhelming and does not satisfy the needs and expectations of the end users.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
378 13 Building Great Artificial Intelligence
Let us first see which kind of machines we consider today as being examples of AI. What
may immediately come to mind is perhaps a robot. However, not just any robot. Most robots
are not very intelligent. Robots consist of mechanical components such as arms and actua-
tors. And then there are batteries and sensors. But those alone are not enough to describe a
robot as having AI. There are many robots that are quite useful to us but are plain dumb.
Examples are vacuum cleaner robots at homes and industrial robots on factory floors. What
makes a difference to whether robots will receive the title of being “artificially intelligent”
or not is what they can do autonomously with all their hardware. Only a smart robot, one
with capabilities far exceeding the plain programming of movements, will be worthy of the
honor of being called an AI. We are here looking for a robot that can exhibit a variety of
different behaviors, or be able to find its way in a complex environment, or accomplish tasks
in a variety of novel situations. For example, think of an anthropomorphic robot capable of
clearing up a table full of dirty dishes, then manually washing these dishes and finally,
drying them and putting them into the cupboard – and all that without breaking anything!
Robots with such a level of skill do not yet exist.
To begin creating such a robot it may soon be clear that training deep learning models will
not be enough. One may choose to rely on deep learning to a high degree and yet, the robot
will need a lot more than what deep learning can offer. To foster the required intelligence in
the robot, we will need to create and use technologies much broader than what machine
learning can offer – and also, much broader than what data science covers. Still, data scien-
tists will play a critical role in developing such robots. Hence, you find yourself reading this
chapter.
One type of a robot has obtained significant attention from the industry and also a great deal
of investments: our cars. A lot of money has been poured into making cars capable of driving
by themselves and thus into turning them into intelligent robots. The problem of autono-
mous driving is not an easy one, especially not if the vehicle is driving in the “real world”
and not a controlled test environment. The variety of different situations that the vehicle
may encounter is huge. Hence, such vehicles present a great challenge for the technology.
Perhaps the autonomous driving problem is as difficult as cleaning up a table with dishes.
The pressure for the quality of the solution, that is, not making an error, is high too. While
our manual dish-washing robot may in the worst case break a few glasses or plates, a car
robot carries a much bigger responsibility; it is responsible for human lives. This is an addi-
tional reason that makes a successful self-driving car a tough goal. Nevertheless, there has
been quite some progress in this domain. Arguably, autonomous vehicles are the smartest,
most intelligent robots which mankind has built so far. And yet, there is still work to be
done. The question is then: What did it take to make those machines intelligent? And which
intelligence related problems and hurdles do these machines still face? Is it all simply the
data science of building bigger and smarter models, or is there more to it?
To address these questions, let us first establish that AI does not equal a machine learning
model. To understand that, it helps to make a distinction between a product and a critical
component necessary to build a product. A product is a lot more than just its critical compo-
nents. A knife is more than a blade, although a blade is its critical component. A monitor is
more than its critical component, the screen. A memory stick is more than an SSD chip. A
bicycle is more than a pair of wheels and pedals. In all these cases we note that a product is
more than its critical components.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.1 How AI Relates to Data Science and M
achine Learning 379
We can appreciate this difference in the example of a car. A car suitable to sell on the market
and thus suitable to produce value for the customer is more than an engine placed on four
wheels. For a car, one needs a steering wheel and brakes. Yet, this is still not a complete
product. A full product also requires headlights for night driving, a windshield, doors, win-
dows on those doors, wipers on the windshield. One also needs a full cabin with seats. Then
one needs a heating system, air-conditioning, and an entertainment system. All this needs
to be packed into a beautiful design which is pleasing to a human eye. Only after putting all
of this together, are we beginning to have a full product called a car.
An AI is like a full product. It is a machine that does some service for a human and in order
to get this service done in a satisfactory fashion, the machine has to be complete. One must
create a full product. So, a machine learning model may be a critical component for an AI,
maybe the equivalent of what an engine is for a car. Importantly, however, we have an AI
only after we have built a product around that (machine learning) engine.
In practice, as a bare minimum, creating a product will require putting the model into pro-
duction and establishing an interface for acquiring inputs that will go into the machine
learning model and then also generating some form of output. Often, there is a lot more
required to create a useful product. As we have seen in the case of the autonomous vehicle,
there is a lot of hardware needed to create a complete car.
But it is not only “non intelligent” components that one needs to add to machine learning
models in order to create an AI. A deeper reason why machine learning alone is not enough
for AI is that AI solutions are often a lot more complex than what could be achieved by a
single machine learning model. For example, let us consider a chat bot. Let’s assume that all
we need to create, outside of the intelligent component, is a minimal interface consisting of
text fields to enter users’ questions and print the machine’s answers. One may conclude,
then, that it should suffice to place in-between these two components a large, well-trained
machine learning model to do the chatting with a human user. Unfortunately, this is not how
it works. Every elaborate intelligent chatting assistant (think of Alexa, Siri, Cortana, etc.) is
a lot more complex than relying on a single deep learning model.
Below is the architecture of the original Watson AI solution – a machine that made history
in 2010 for winning the game of Jeopardy against the top human players in that game. It is
clear that the organization of this AI was a lot more elaborate than a single machine learn-
ing model. In fact, many of its components do not even rely on machine learning and yet,
they nevertheless contribute to the overall intelligence of Watson. It is necessary to under-
stand that only the machine as a whole is an AI; no single component alone is one. Much of
this overall intelligence comes from the architecture – how the flow of computation is orga-
nized and how it is decided which component will be executed when. Thus, it is not only the
weights in the machine learning models that contribute to the overall intelligence. There is
a lot more, including the rules by which different models mutually interact and help each
other. Only the full combination of all the parts, the Watson, is a full product and is an AI.
Something similar holds for the intelligence of autonomous vehicles. The internal architec-
tures of the algorithms driving the cars are not any simpler than that of Watson. Moreover,
over time, as cars become smarter and better drivers, the number of components and the
internal complexity of overall AI solutions tends to increase.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
380 13 Building Great Artificial Intelligence
Figure 13.1 The architecture of the original Watson AI that won the game of Jeopardy against the
best human competitors (https://en.wikipedia.org/wiki/File:DeepQA.svg)
Importantly, many of the components of such solutions are also not machine learning
models, but employ some other algorithms. These other components may involve searches
through databases, brute-force approaches to finding optimal solutions, pure scientific
calculations, rule-based decision making, and so on. Again, all those components jointly
contribute to the overall intelligence of the AI.
Finally, there is one more reason why machine learning and AI are not the same thing.
Machine learning is often used for purposes other than building intelligent machines.
Machine learning has uses that exceed what AI is meant to do. In particular, machine learn-
ing is often used as a tool for data analysis. The author of this chapter has extensively used
machine learning tools as a means for analyzing how the brain stores sensory information.
We trained machine learning models to read information from brain signals. Critically, what
interested us was not to build a product. Rather, we asked questions about the brain, for
example, how long does the brain hold information about an image that we briefly pre-
sented on the screen? Or, how fast can this information be erased by a newly presented
stimulus? In this way, we generated numerous insights on how the brain maintains sensory
information [1-3]. For pure engineers, such a use of machine learning may come as a sur-
prise. However, for a data scientist, this should not be so unexpected. No scientist should
hesitate from using machine learning algorithms as analytics tools. There are great benefits
from such uses of machine learning, especially in situations in which the data are complex
and insights are difficult to achieve with traditional analytics methods.
To understand the relationship between machine learning and AI, it is common to draw
Venn diagrams, like those depicted in Figure 13.2. The Venn diagram on the left is the one
which can often be seen in AI literature. But the one on the right is more correct, as it also
takes into account the fact that machine learning can be used for purposes other than AI.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.2 A Brief History of AI 381
Figure 13.2 The relationship between AI and machine learning (ML). Left: the relationship as
commonly depicted in the literature. Right: A more realistic depiction, showing that machine
learning can be used for purposes other than AI, such as analyzing data. GOFAI stands for
‘Good old-fashioned AI’, which does not employ machine learning.
spoon-fed by humans. This produced a huge bottleneck as often the amount of knowledge
needed to be manually set up was too overwhelming. Therefore, the GOFAI approach to
increasing the intelligence of machines became unsustainable. As a result, many projects
failed to reach the level of usefulness, not moving much further from the initial proof-of-
concept; what worked well on a small scale did not materialize on a larger, more useful
scale.
Today, in the second stage, we overwhelmingly rely on machine learning algorithms to feed
knowledge into machines, transforming it from large datasets into matrices of model param-
eters. These algorithms provide a great relief from manual work. All that humans need to do
is provide examples of intelligent behavior. The machine is then able to extract the rules by
which this behavior is being made.
Obviously, this way, we have achieved a great advancement in our ability to increase
machine’s intelligence. However, it is incorrect to assume that the world has moved away
from symbolic AI and that GOFAI algorithms are history. Not at all. The symbolic approach
is still alive and well. Every complex AI solution created today is a mash of machine learning
and GOFAI components. Symbolic AI is no less important a part. It is only that GOFAI com-
ponents are not being advertised, which has more to do with the current hype and market-
ing strategies than with the facts on how the machines work under the hood. Symbolic AI is
all over the place. Often it is GOFAI who decides which deep learning algorithm to run next.
Other times, GOFAI receives outputs from machine learning models to make the next deci-
sion. In other approaches, machine learning assists GOFAI in finding an optimal solution.
And so on. Often, the two components are nested: a symbolic algorithm calls machine learn-
ing model which in turn calls another GOFAI component for help, going back to machine
learning and so on. The possibilities are limitless. Watson could not win a game of Jeopardy
without GOFAI components. Without using a GOFAI, in 2016 alphaGo could not have won
the game of go against the world champion, Lee Sedol (the score was four to one for the
machine). An autonomous vehicle cannot drive without old-fashioned AI components.
Alexa, Siri and co. cannot engage in a conversation with you without symbolic parts of their
overall intelligence architectures. And so on.
What does it all mean for a data scientist today who is tasked with developing an AI prod-
uct? Very likely, your solution will need to involve a lot more things than just a machine
learning model. There will be a lot of engineering needed outside of machine learning. It
will be difficult to avoid symbolic components. This means you will have to make wise archi-
tectural decisions about the entire solution and these decisions will include a lot more than
just machine learning. Moreover, to create an effective product, you may even need compo-
nents that lie outside of engineering. A good design of the interface for your AI may be as
critical for its success as will the performance of the underlying model. Much like one needs
to add an ergonomic handle to a blade to make a good knife, or needs to provide comfortable
seats to make a great car, your AI will need to evolve in many different dimensions in order
to present a great product. Machine learning models will be just a part of the entire result
and thus, only a part of the entire customer experience.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 383
■ 13.3 F
ive Recommendations for Designing
an AI Solution
On the way to creating an AI solution, a data scientist will need to make a number of deci-
sions. You, as a data scientist, will necessarily have to create an architecture combining
components of different types, interacting and jointly bringing the intelligence to your
machine. Perhaps you will draw this architecture with multiple boxes and arrows, like the
drawing of the Watson architecture in Figure 13.1. The question is then: Which strategies
can you use and what should you look for to avoid certain common mistakes?
In the previous chapters of this book, you have seen various recipes on how to solve data
science problems. This is all presented to you as individual pieces; for example, as individ-
ual machine learning algorithms. Also, the pieces are shown in an idealized world, indepen-
dent from real life. When you design a real AI – a complete product – you will need to think
about how to pick algorithms for an imperfect world. You will need to think about how to
combine them. Also, you will need to find and use algorithms not described in this book. It
is important not to stick with one set of algorithms just because they worked for you in the
past, or just because this is what you know. Expand your knowledge, as you need it. Pick the
algorithms based on their suitability for a given problem, not based on convenience. Keep
in mind that your new problem will always be slightly different from anything else that you
have seen in the past. Be eclectic in selecting the tool to solve the tasks. Choose from the
widest selection that you possibly can. Do not limit yourself.
Also, stay pragmatic. Your first concern should be achieving the goal. You do not always
need to use the latest algorithms, the hottest and most-hyped tool. Rather, take whatever
works best for the problem at hand. I have seen data scientists falling “in love” with certain
types of models and then playing favorites. But success in data science does not come when
you play favorites. I have seen people trying to solve every problem with the same approach.
There are individuals who expect that everything must be solved with deep learning. I have
also seen die-hard fans of Bayesian approaches. Sure, both Bayesian and deep learning
methods are charming and have some attractive features, giving them unique “super-pow-
ers”. However, both also have disadvantages. In fact, any approach you pick will have some
advantages over others, and necessarily also some disadvantages. Your job is to consider
both sides and weigh the pros and cons in order to make a good choice.
It is paramount to be aware of both advantages and disadvantages of any given method or
algorithm. Disadvantages may be harder to learn about because authors who publish papers
about their new methods tend to focus on the positive aspect. The rosy pictures are what
motivates them to perform the research and write papers in the first place. So, we should
have some understanding. Nevertheless, one still needs to acquire a skill for “reading
between lines” and detecting possible limitations and pitfalls. An experienced data scientist
will be able to smell possible disadvantages of a new method, even if they are not as clearly
spelled out as are the advantages. Develop such a skill, as it will give you a lot of powers for
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
384 13 Building Great Artificial Intelligence
making good design decisions for your AI architectures. The goal is to acquire knowledge
about a lot of algorithms, models, and optimization techniques.
The pool of tools to pick from is huge. A single person can probably never have a full over-
view of the data science field. Acquiring comprehensive knowledge on machine learning
methods and AI algorithms requires life-long learning. And you are never finished. More-
over, the pace with which new algorithms are being proposed is increasing rapidly as more
and more people work on the topic, universities open new AI and data science departments,
and governments funnel more money towards research in the AI. Keeping up with every-
thing that is going on is a challenge. You should never stop learning but also never expect
to know it all.
What helps in navigating this ever-growing forest of new works is a thorough understand-
ing of algorithms. You will be more efficient in understanding a new algorithm if you already
have a deep understanding of a related, existing one. Superficial understanding of methods
is not nearly as powerful. Proper understanding of several different algorithms, each
belonging to a different category, is probably the best strategy one can undertake towards
mastering the field of data science. New algorithms are often related to the existing ones.
Rarely, researchers come up with an entirely novel approach to solve a machine learning
problem (although occasionally they do exactly that). If you understand deeply one algo-
rithm, then it becomes easy for you to quickly grasp the essence of its cousins – they be-
come a variation on the theme. In contrast, if you only superficially understand an algorithm
in the first place, a variation of this algorithm may be a mystery for you, and you may have
difficulties deciding whether this new variation will be helpful for your new problem or not.
One can always try the algorithm on the data and see what happens. There are also tools for
trying multiple algorithms automatically and picking the best one (referred to as autoML).
But this cannot get you far. You cannot develop an autonomous vehicle by randomly trying
different architectures. By building AI, you will have to do good old human thinking – and a
lot of it. In this case you want to minimize decision making by trying the algorithms on your
data. Sure, you will have to do that at some point, there’s no doubt about this. However, what
makes a difference between an experienced AI developer and an inexperienced one is that
the former can achieve the task with more thinking and less trying. Experienced people can
sift through possibilities in their heads, without having to train the algorithm on the data.
The extended knowledge allows them to detect that something is not going to work well even
before they try to make it work. This saves a lot of time.
What else can help you make good decisions? A good idea is to draw your future architecture
before you start coding. Specify the details and try doing mental simulations of the flow of
data throughout the system. At each step ask yourself a question: Do I see a reason this step
would fail or have difficulties? If you do see possible problems, address these problems
immediately. Pragmatic is to address the weakest points first. Do not hope that a miracle
will happen after you spend time working on the easy part.
There is a common belief that with enough computational power and a sufficient amount of
data, anything is possible: that anything can be learned by a machine. Although there is
some truth to this statement, there is also quite a bit of falsehood there, too. Some of these
issues I will address later within this chapter. The bottom line is that blindly following a
strategy of more-data-with-more-computation-power is almost guaranteed to bring you prob-
lems. It is much better to thoroughly clean up your algorithms by using your understanding
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 385
of statistics, machine learning and AI in general. Keep the faith in big data and computa-
tional power as your last resource.
Certainly, you will need to try out different designs. And you will need to use the results of
these trials as feedback. They will guide you on how to improve. It is vital to realise that your
iterations will be much quicker and much more effective if you understand more deeply
what you are doing.
Thinking is comparatively hard. Coding and running models is comparatively easy. Still, not
shying away from doing the hard part will likely give you the competitive advantage that
you will need to create a product that the market needs and enjoys.
Finally, do not forget that one person does not know everything. Build a team of people with
different topics of expertise. Have everyone contribute; everyone should have a say. Make
sure you get everyone’s talent used towards your final product.
13.3.2 R
ecommendation No. 2: Make it Easier for Machines to Learn –
Create Inductive Biases
There is one simple truth about machine learning algorithms: Some learn faster and better
than others. In some cases, it takes just a few examples to reach high performance. In other
cases, millions of examples are needed. While there are many reasons for these differences,
there is one reason which you have the power to control: One factor that determines the
learning efficiency of an algorithm are its inductive biases. Inductive bias is like a piece of
knowledge added into an algorithm, enabling it to skip some learning steps and walk
quicker and more confidently towards the end. Literally, inductive biases enable algorithms
to jump to conclusions. And if you have inserted the right inductive biases, your algorithm
will jump to the right conclusions, too.
So, what is an inductive bias? It is a predisposition towards finding (i.e., inferring, inducing)
a certain relationship in the data. Inductive biases help the algorithm find a certain rela-
tionship even if the evidence is very weak and would otherwise require going through
millions of data points. Inductive bias is a sort of prejudice to detect a given type of pattern
in data.1 For example, if your mathematical model is made from sine and cosine functions
and you fit mostly the parameters of such functions (e.g., amplitude and phase of a sine),
then your model will likely be able to fit such functions in the data, even with small amounts
of data. In other words, the model will have a bias towards finding a sine wave.
What tricks people into ignoring the importance of inductive biases is that one can in t heory
use the same type of sine-based model to approximate functions other than sine waves. You
could combine millions of sine waves to accurately approximate a power-law function. But
this is much harder. You will need a bigger model – that is, one with a larger number of
elementary sine waves and therefore, a larger number of parameters – and you will need
more data for training.2 This relationship holds for any model and for any data. You can
1
Inductive biases have nothing to do with biases in the data, which is an entirely different problem.
2
Fourier Transform is a tool to assess how complex a sine-wave-based model is needed for a time series. Time
series that are periodical and resemble the shapes of sine-waves can be approximated by simple models. Others
need complex models and many parameters.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
386 13 Building Great Artificial Intelligence
a pproximate almost anything with large enough deep learning algorithms. And you can
achieve similar feats with large enough decision trees (see Section 12.3.3 for decision trees).
There is even a mathematical theorem, the Universal Approximation Theorem3, which
proves that an artificial neural network with only one hidden layer can approximate any
mathematical function provided enough neurons are available in the hidden layer [4]. So,
what is the problem then, if we can approximate anything? Why would we worry about
adding inductive biases if models can approximate any function without them? I have
already hinted at the most obvious problem: If the inductive biases of the model do not
match well with the data, you need a lot of data and a big model and a lot of computation.
This also means more CO2 released into the atmosphere during the training and production
of the model. None of that is good news.
On the other hand, if you add the correct inductive biases, you can reduce the model size.
You can then train it with fewer data points as this leaner model does not fall easily into the
local minima of overfitting4. The advantages of inductive biases are the reason that we have
so many different models. Every problem is a little bit different from any other problem and
can be thus more optimally tackled with a more specialized set of equations. Every problem
has, in theory, a most optimal possible model specialized for just that problem. Hence, we
will never run out of space for inventing new models. The list of all possible models is
infinite; we will never reach the end of this list.
I learned about the power of inductive biases in practice on one occasion where my team
and I wanted to induce overfitting in deep learning neural networks. Our end goal was to
test an algorithm that reduces overfitting in a situation of one-shot learning, and our
approach was as follows: generate an unlimited amount of data for training the one-shot
learning algorithm (see Chapter 17)5, induce overfitting on this dataset, and then ‘save’ the
network from overfitting, using our new algorithm. My idea was to create our ‘unlimited
data’ using one deep learning network with a random set of weights, and then train another
naive deep learning network to learn the same random mappings. We were confident that
we could create overfitting this way, but were proven decisively wrong: We kept reducing
the size of the training data set, but the new network did not want to overfit. The perfor-
mance on the test data remained good, sometimes with as little as 10 or 20 data points. At
first, my colleagues and I were puzzled. How was that possible? These were supposed to be
very hard data to learn, with complex random relationships in a multi-dimensional space.
How could the network learn these relationships with only a small number of examples?
This learning was efficient even when we changed the architecture of the network, the num-
ber of layers and the sizes of each. The ability to efficiently learn the data was robust.
It took a few days for us to realize that the model which we hoped would overfit was ‘doomed’
not to, as it had perfect inductive biases for the data. We used the same ReLu and sigmoid
transfer functions for generating data and for the model that was learning the data, which
basically made the learning model’s job very easy. This illustrated to me how powerful in-
ductive biases can be: the same network may need a million examples to learn something
counterintuitive for its inductive biases, such as recognizing a flower on a photograph, and
only ten examples to learn something that is highly complex for any other model but is
3
https://en.wikipedia.org/wiki/Universal_approximation_theorem
4
https://en.wikipedia.org/wiki/Overfitting
5
One can learn here about one-shot learning: https://en.wikipedia.org/wiki/One-shot_learning
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 387
perfectly intuitive for this particular network. This is because the network has exactly the
right inductive biases.6
Inductive biases give us a lot of possibilities to play with when developing models. The game
is two-dimensional. One dimension relates to the type of inductive biases: Should we use
ReLu or sigmoid transfer functions, or should we use tangent or even sine waves? This way
we are changing which assumptions the model makes about the world. We can replace one
assumption for another, and by doing so, we change the inductive biases. A linear model
makes a specific assumption about a linear relationship between data. A decision tree
makes yet another assumption. And so on.
The other dimension along which we can play with inductive biases is, how tight are the
assumptions we want to make? We can make a more relaxed set of assumptions, which
basically means having a model with more parameters. We can also make a stricter model,
with fewer parameters. By adding more units (neurons) to a neural network, we are relaxing
its assumptions. Models that are well suited for a given problem, i.e., have exactly the right
set of inductive bases, can often do great work with only a handful of parameters. The big-
gest models today have billions of parameters. These models are quite relaxed: There are a
whole lot of different things that they can possibly learn.
As we mentioned, this has direct implications on the amount of data needed to learn. A
strict model will be able to learn from only a few data points of course, provided that the
inductive bases are correct. If the inductive biases are incorrect, then a small model will
never fit well, no matter how many data points you give it for training. Your only two options
for improvement are either increasing the size of the model (with a corresponding increase
in the data set size), or getting your inductive biases right. Therefore, even with bad induc-
tive biases you can fit data well; all you need is enough parameters and enough data. Deep
learning falls into this latter class of models, not specialized, having relaxed assumptions,
and requiring a lot of data. See Figure 13.3 for the relationship between the number of data
required (expressed as ‘Training effort’) and the strictness of the model (expressed as
‘Specialization’), across different types of models. The strictest models are the laws of
physics. For example, E = mc2 has only one parameter to fit, namely c. Then one can use the
‘model’ to predict E from m.
6
Later I learned that someone made the same mistake as we did and published a whole paper without realizing the
inductive bias issue that we discovered, thereby making the incorrect conclusion that neural networks are not
susceptible to overfitting [5].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
388 13 Building Great Artificial Intelligence
Figure 13.3 Different models have different abilities to learn. Some require a lot of data and a high
training effort. Others can learn quickly with only a few examples. A model is used optimally if it lies
somewhere on the diagonal: In this case, the right model has been chosen for the task. If the
amount of data you need and the training effort are too large for the given level of specialization,
then you are doing something wrong even if your model is performing well (the white triangle). It is
impossible to have a well-performing model that is both generic and requires a small amount of data
to learn. This can only happen in fantasy and sometimes, naively, data scientists hope to find such a
model.
So, how can you take advantage of this knowledge about inductive biases? You can introduce
such biases into your models to help those models learn better and quicker. This will allow
you to make models smaller, faster and more reliable. You just need to find the right induc-
tive biases. Sometimes, you will need to do the opposite, increase the size of the model and
thus, relax its assumptions. You have to find out what the right approach is for your prob-
lem. In fact, if you have ever performed hyperparameter tuning7, then you have already had
your first experience in adjusting models’ inductive biases. If you have well-structured
validation and train data sets, then you have the chance to find a more appropriate architec-
ture of a model for your data, and thereby improve the inductive biases of your model.
But you can do even more. Nothing is stopping you from creating original, new inductive
biases that nobody ever has created before and that are exactly suited for your problem. You
may be thinking, “but this is not what people do”. You would be wrong. New inductive
biases are created every time a new machine learning architecture is introduced. Practically
every research paper on deep learning that proposes or tests some novel architecture is in
fact proposing or testing a new set of inductive biases. Why would they need to do this? We
have already mentioned the Universal Approximation Theorem, which states that a network
with a single hidden layer can approximate anything. So, why wouldn’t we use such vanilla
deep learning for everything? The problem is that it would require exorbitantly many data
points. To reduce these demands, researchers come up with more specialized models
designed for specific purposes. This is how the ‘deep’ part of deep learning was born, which
7
https://en.wikipedia.org/wiki/Hyperparameter_optimization
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 389
means stacking more layers to a neural network: Researchers looked for ways to introduce
good inductive biases. Adding more layers did the job. For example, convolutional layers
create great inductive biases for the processing of images. Images tend to have a lot of local
correlations and weak correlations across long distances (see Section 12.6.4), and this is
what convolution takes advantage of. Deep Learning with long short-term memory (LSTM)
cells makes assumptions about temporal dependencies within the input data (LSTMs are
described in Section 12.6.7). And so on. Here too, the rule holds: these inductive biases will
make your model more specialized, and it will be applicable to a smaller range of problems.
For example, if your data have long-range correlations, convolutional models will not be able
to pick up this information. This is often exactly the case with natural languages, where
meaning can be determined by words which are quite distant from one another. Therefore,
for problems such as language modelling, we cannot use convolutional networks but in-
stead, certain other architectures that have inductive biases more suitable for the structure
of language (Natural Language Processing is covered in Chapter 17). Every good inductive
bias that you can come up with will be a blessing for your model. Just keep in mind that you
are making your model more specialized for a specific class of problems.
Some of the most elegant inductive biases exist in the equations used in physics, which we
call then ‘the laws of physics’. Chemistry is also not doing too badly. Scientists throughout
history have made enormous efforts to find perfectly fitting, simple models, as I explained
above with regards to Einstein’s famous formula. In data science, we cannot hope to achieve
that in real life outside of basic science. Still, we can try to approximate it as much as possi-
ble. The rules of logic, which are a part of GOFAI, can help us move towards this goal. There-
fore, combining GOFAI with deep learning can be a lot more productive than either of the
two alone. This is also why we need to build complex architectures for AI. A well-designed
architecture introduces inductive biases into the overall AI solution.
There is one more common activity in data science that is intimately related to inductive
biases. This is feature engineering. When you perform feature engineering, you usually
think of it as transforming the data to work better with a machine learning model that you
are feeding the data into. However, in effect, you are introducing inductive biases. To under-
stand that, think about a combined model, one that is formed by merging your routines for
data transformation with the model to which you feed the data. The two together form a
super-model. This super-model has more stringent and more suitable inductive biases than
the original deep learning model alone. In other words, you may use an off-the-shelf model
that is general and then you add a data transformation algorithm to it to make a new model
that is more specialized. This is why feature engineering is so useful.
Many other tricks to improve the performance of your model are in fact, introducing induc-
tive biases. Various regularization techniques such as ridge, lasso or dropout8 all introduce
specializations into models, each through a slightly different inductive bias. But this nature
of regularization is also why these tools do not always work, and why their advantages tend
to be visible only with a small number of data points. They may have negative effects on
performance with larger amounts of data. Sometimes, it is better to leave the model not
regularized. This is because the inductive biases made by regularization are not the perfect
ones; they are only roughly approximating some data properties. An unrestricted model can
8
https://en.wikipedia.org/wiki/Regularization_(mathematics)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
390 13 Building Great Artificial Intelligence
learn more fine relations than what regularization can offer, but again, the price is the
amount of data required.
Sometimes, data scientists are not quite aware of these dependencies between inductive
biases, model performance, and amounts of data required. Sometimes, they naively hope
that one could have it both ways: a general algorithm that can learn nearly anything, and do
that with small amounts of data. This is mathematically impossible. Such a super-algorithm
is something like a perpetuum mobile of data science; intuitively it seems possible, but only
once we look into the laws of machine learning we realize that it is, in fact, impossible. Don’t
make this mistake because you will waste a lot of time trying to achieve the unachievable.
Understanding your problem well is the best route towards effectively building useful induc-
tive biases. The effort needed to invest in understanding the problem will always be a trade-
off between the amount of data you have (and the amount of computational power you
require), and the degree to which you understand the problem. The more data (and compu-
tation power) you have, the less you will need to understand the problem and still be able to
create a well performing model. The more your data shrinks, the more you must think.
One cannot understate the importance of analytics. The worst thing you can do after getting
hold of your data is directly jump into feeding it into a model. You need first to understand
your data. You will need to interrogate the variables and the relationships between them. I
don’t have in mind only the basics, such as descriptive statistics, histograms and correla-
tions. I am referring to analytics that will help you decide which model is best for the prob-
lem at hand. Analytics is the best way to come up with great inductive biases for your data.
I want to give another example from my own experience. At one point in my career, I was
tasked with building a model that monitors the market of crypto-currencies and proposes
an optimal portfolio – i.e., an ideal combination of crypto currencies – for reducing risks as
much as possible. In such models the goal is to minimize the ratio between the gain won and
the risk taken. A naive approach to this problem would be to attempt to train a deep learning
model or some other off-the-shelf machine learning model. One reason that this would not
work in my case was that the amount of available data was way too small. The entire history
of the crypto market was only a few years old and there was simply not much data in exis-
tence. Moreover, much like every other market, the properties of the crypto market change
over time. So, whatever regularities you extract during one period, tend no longer to hold
any more in the next. We needed the right set of inductive biases. What made much more
sense is to do something similar to what is being done in classical markets: develop a good
theory of how the crypto-market behaves and describe this in a few simple equations. These
equations would serve as our inductive biases. Luckily, we already had a good starting
point: The classical stock market concepts already existed, such as the Sharpe ratio, effi-
cient frontier, capital allocation line, and others [6].
We performed an extensive analysis of the existing data from the crypto market, with the
goal to investigate the degree to which the rules of the classical financial markets still
applied there. We literally went step-by-step through an introductory book in financial
mathematics [6] and tested for each testable property of the stock market, whether it also
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 391
held true for the crypto-markets. We found that, while many things were similar across
different markets, the crypto market had its own unique properties. For example, in crypto
markets, volatility of volatility turned out to be important. Based on these insights, we cre-
ated an AI solution that was partly based on novel algorithms that we invented. We created
a whole new type of inductive biases and hence, a whole new type of a model. And the effort
paid off. Our first version of the model already performed well, behaving stably throughout
all the turbulences of the crypto market. The AI occasionally made unexpected choices.
Each time something like this happened, we resorted to analytics to check how and why the
decision was made. Each time it turned out that the AI made a correct decision. The market
was ‘crazy’ and the AI needed to act accordingly. It kept doing everything right. Over time,
we were able to show that our AI performed better than a major index of crypto-currencies.
For me, this was a great example of building a model based on analytics and the consequent
introduction of inductive biases. The blind trial and error we see in, for example, the afore-
mentioned AutoML, cannot beat analytics and careful thinking about the problem. This is
especially true when the amount of data is small.
However, analytics is also useful when you have a lot of data. Analytics can help you decide
which type of model would be the best and can help you engineer good features for it. Here
is an example. In one case my team needed to build a model that would predict when a
coffee machine is likely to fail and need to be serviced. As inputs, we had the logs on a day-
by-day basis containing information from various sensors placed into the machines. Before
we trained any models, we performed analytics about the correlations between various vari-
ables (we used advanced forms of correlation such as ‘scaled correlation’9), which led us to
discover that cumulative variables had a great predictive power. Thus, we created cumula-
tive time series simply by summing up all the values until that time point in the original
time series. This also made intuitive sense: The cumulative values represented the amount
of wear and tear in the machines. And sure enough, after we made the appropriate feature
engineering steps and fed the cumulative variables into a neural network, we got a well
performing model. Had we not done the analytics first, we would likely have missed the
opportunity of transforming the data in such a way.
It is important to note that exploring your data can require even more advanced methods
than financial analytics or correlation structures. As I mentioned earlier, machine learning
needs not be used only for creating high-performing models, but also as an analytics tool. I
described our own use of machine learning to investigate how the brain stored information.
Similarly, you can train simple machine learning models on subsets of data to get a feel for
how the data respond, and to assess which inductive biases they like. In fact, every analysis
performed on data implicitly fits a model to those data. For example, if you compute a
correlation, such as Pearson’s coefficient of correlation, you are not only measuring the
degree of association between two variables, but also fitting a linear model to those data.
The resulting degree of association indicates how well a linear model describes the cor-
related variables. The same holds for other types of analyses. Statisticians long ago recog-
nized the fact that, with virtually everything you do, you are fitting a model. In fact, this is
why they created a Generalised Linear Model (GLM) on the foundation of which countless
9
This is an invention of mine about which one can read on Wikipedia here: https://en.wikipedia.org/wiki/
Scaled_correlation
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
392 13 Building Great Artificial Intelligence
other procedures have been developed (like ANOVA, regression, and factor analysis, just to
name a few).
Use this insight to your advantage. There are numerous ways in which you can fit small
models to subsets of data and get useful insights. For example, you may test different as-
sumptions (different inductive biases). Or you may want to gradually reduce or increase the
number of variables to observe how your predictive power changes. You may gradually in-
crease the amount of data you feed your model, to observe how the performance is affected.
This can give you an idea of how much more data you may need or whether you are close to
reaching a saturation point with the given model. You can gradually add to your data noise
that resembles real life, in order to observe the problems that the model may encounter out
in the real world. And so on. Make a note in the research papers that you read of how the
authors themselves tested the model performance. Copy them. Sometimes, you will need to
be creative. There is an infinite world of interesting tests that one can do with a data set.
The insights you get from such analyses will sometimes be invaluable. They may give you a
critical idea on how to improve your AI architecture. They may tell you under which condi-
tions your autonomous vehicle will work well and under which it will struggle. This may
help you prevent disasters before they happen. It may even assist you in ways I myself have
not yet conceived of. As I said, the importance of analytics cannot be understated.
There are a few traps lurking behind the mathematical complexities of machine learning
problems. Data scientists may fall into those traps and confidently build expectations about
how their model, or more often, their AI solution, will work in the future. You have been
caught in a trap in the past if your project looked all good and promising during the proof of
concept, and if, on the basis of this good performance and by applying a basic logic about
how to create a full product, you (or your company) went on and invested significant effort
and resources only to discover that the full product did not work nearly as well as it should
have, based on the initial plans. In the worst cases, the project needed to be abandoned. In
less bad cases, the approach to the problem needed to be changed. Nevertheless, the
expected time of completion needed to be significantly extended.
Such traps, consisting of a deceivingly rosy picture in the beginning, happen in all types
of engineering projects. Yet AI seems to have more than its own fair share. Consider, for
example, how often the Tesla corporation has been delayed in their development of fully
autonomous cars. The first expectation was 2017, then the deadline kept moving. At the
time of writing this book (2021), the performance is still not satisfactory.10 It always seemed
like one more big step was needed, but then the result did not work to all stakeholder’s
satisfaction, and a next step was needed, which again proved unsatisfactory, and so on.
Another example was IBM’s ambitious attempt to create AI for health care, or an AI-physi-
cian, if you will. This had many troubles and again, what engineers and data scientists
expected to work, did not, or was realized only after significant additional resources had
10
https://en.wikipedia.org/wiki/Tesla_Autopilot
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 393
been invested. There is still no well-performing “Watson for Oncology”.11 The majority of
such ‘failed’ projects are likely not being advertised, and hence we will never know the true
extent to which they occur in the world of AI.
The trap of seeing too rosy a picture of the future is not a recent phenomenon. The history
of AI is packed with failed predictions made by the most prominent AI minds at the time.
Even Alan Turing, arguably the smartest person ever to work in AI, predicted in 1950 that
“… around the year 2000, computers will be able to answer questions in ways identical to
those of human beings”. Now, 20 years after his proposed turning point, we are still far from
achieving that. In 1958, Herbert Simon, a Nobel Prize winner and a leader in AI and cogni-
tion, together with his colleague Allen Newell, predicted that “Within ten years a digital
computer will be the world’s chess champion”. In 1965, Herbert Simon also stated that
“machines will be capable, within twenty years, of doing any work a man can do.” In reality,
a ten-year prediction about chess turned into 40 years of work, as it was only in 1997 that
IBM’s Deep Blue beat Garry Kasparov. And, as mentioned, we are still not at the point of
machines fully replacing human work. It seems that this human-level intelligence is some-
thing that particularly inspires AI researchers and traps them into making overly ambitious
predictions. For example, Marvin Minsky, widely regarded as one of the fathers of AI, said
in 1970, “In from three to eight years we will have a machine with the general intelligence
of an average human being.” Historically, the next step in AI always turned out to be harder
than what it seemed to the biggest and best minds at the time.
I believe that we can identify today two types of thinking traps which lure people into mak-
ing ambitious promises, and that we can educate ourselves about them. If we understand
these two traps, we will be less likely to make AI promises that we cannot keep, and our
projects will be less likely to fail.
The first trap is what can be called a ‘scaling trap’. This occurs when everything works well
on a small scale, in terms of the amount of data and the size of the model, and we then have
to repeat exactly the same thing on a larger scale, with more data and bigger models. The
intuition is often that, if we made it work on a small scale so well, it should not be too hard
to scale the same approach up to a much higher intelligence. If it took just one month to
teach a robot to avoid two types of obstacles, it should not take much longer to teach it to
avoid hundreds of obstacles. We should expect some sort of acceleration effect over time
whereby, after having already learned so many different obstacles, some generalization
should take place and learning should go quicker for new obstacles. This is what intuitively
makes sense. Similarly, if it took 100,000 example images to train my deep learning model
to accurately distinguish ten categories of objects in photographs, imagine what we can do
with tens of million images. Perhaps, the model should be able to reliably distinguish every-
thing we practically need. Again, some sort of benefit of learning at scale and some effect of
generalization makes intuitive sense: Later, the new objects and items should be added
quickly into the model. There should be a point at some scale of the model at which every-
thing tremendously accelerates.
This is a question of how well the intelligence of machines scales. Increasing the accuracy
or keeping the accuracy but increasing the number of categories distinguished both result
in increasing the intelligence of our AI. So, the question of scaling intelligence is a question
11
https://spectrum.ieee.org/how-ibm-watson-overpromised-and-underdelivered-on-ai-health-care
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
394 13 Building Great Artificial Intelligence
of how many resources we need to add in order to increase the intelligence of the machine
to the next level. To double the intelligence, do we also need to double the resources? That
is, do we need to roughly double the amount of data and the size of the model?
Previously, researchers did not have answers to those questions. Today, we can derive clear
answers from experiments that have been performed with models and data sets of different
sizes [9]. Unfortunately, the answers are not good. While intuitively, we expect the effort to
go down as we increase the proportions of the models and the amounts of data, the opposite
is in fact true. The amount of effort needed to bring a model to the next level increases with
the magnitudes of the models. It takes a lot more effort to increase accuracy of an already
accurate model than of an inaccurate one. Similarly, it takes a lot more effort to add one
more category (and not lose classification accuracy) to a machine learning model that
already knows many categories than to a model that knows only a few categories. Increas-
ing the intelligence of machine learning models is a staircase in which every new step is
higher than the previous one.
This is a hell of a trap for those who do not keep this fact in mind every time they try to
improve the performance of a model. Our intuition will always tell us: there must be some
sort of an easy trick, I just need to find it.
Unfortunately, your mind is misleading you. The intelligence of machine learning models
does not tend to scale that way. Rather, demands on resources for creating intelligent
machines tend to explode. These demands will often grow as per the power law: If you need
to double the intelligence of your deep learning model, it will not be enough just to double
the resources. You will need much more. You may need to raise your needed level of intelli-
gence to some exponent. The truth is that the demands on resources grow with the demands
on intelligence. And this exponent is larger than one, a lot larger than one.
So, here is the scaling trap in a nutshell: i) our small-scale model works well. ii) our minds
compel us to assume approximately linear scaling properties of the problem, iii) we make
rule-of-the-thumb predictions about the resources required to achieve a full product, iv) we
get to work according to these estimates, v) after significant time and effort, we realize that,
although the model can do a lot more than before, its overall value and quality is miles
below what was expected. We have fallen into the scaling trap.
Let us look at some evidence. MNIST12 is a popular data set for training and testing models
to distinguish ten categories of hand-written digits. To achieve 88 % accuracy on this data-
set, it is enough to use a linear model with about 100,000 parameters [7]. However, it is also
possible to achieve about 99.9 % accuracy on MNIST, which is just over a 10 % increase in
accuracy. How? If your suggestion would be to increase the number of parameters in the
model also by about 10 %, you would be mistaken. In fact, the number of parameters needed
is much larger and also, a great deal of additional inductive biases are required. The best
performing model at the time of writing is 99.87 % accurate but requires about 1.5 million
parameters, a whopping increase of 1500 %, and this is a result of 20 years of research into
inductive biases for these types of problems.13 Obtaining the last percentages in accuracy is
increasingly hard. A very small improvement requires a huge effort, not only in the model
12
https://en.wikipedia.org/wiki/MNIST_database
13
https://paperswithcode.com/sota/image-classification-on-mnist
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 395
size and the training time, but also in data enhancing and in human thinking on how to
improve the overall architecture.
We have made great progress in AI in recent years, but we have also had to increase the
demands of our machines. The sizes of our models grew tremendously over the years. In
fact, they exploded. In Figure 13.4, we see how the computation needs of various models
have grown since the year 1960.14 Note that the scale on the y-axis is logarithmic. This
means that a linear curve in the plot indicates exponential growth in the computational
demand. Moreover, the graph shows two exponential trends: the one before about 2012
roughly follows Moore’s law, which tells us how fast the computational power of our hard-
ware grows. In particular, Moore’s law states that the computational resources double
roughly every two years.15 The other slope, after 2012, has a much steeper growth, with
resources increasing tenfold every year. In other words, the exponent of the curve increased
in recent years. We are clearly building ever more gigantic models – models that do not
follow Moore’s law. The demands that we request from hardware are much larger than the
speed with which the “bare metal” of the electronics improves. As a consequence, in recent
decades the execution of model training had to shift first from CPUs to GPUs of single com-
puters, then from GPUs to farms of GPUs in computational centers. The latest trend is to
build computational centers dedicated just to deep learning models.
Figure 13.4 An analysis made by OpenAI about the amount of computation needed to train a
model and the year in which the model appeared (reproduced from: https://openai.com/
blog/ai-and-compute/)
14
https://openai.com/blog/ai-and-compute/
15
https://en.wikipedia.org/wiki/Moore%27s_law
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
396 13 Building Great Artificial Intelligence
While we are building these gigantic models, the question is whether their intelligence
equally gigantically increases. Do they become tenfold more intelligent every time we
increase their size by a factor of ten times? According to a study by Bianco et al [8], this does
not seem to be the case. In Figure 13.5, we see the relationship these authors obtained
between the accuracy of a model on the ImageNet dataset16 and the model’s demands on
resources, expressed as the number of floating-point operations required to compute the
model. The relationship is similar to that which we saw for the MNIST dataset: We have to
undergo large increases in model sizes only to increase their performance relatively little.
This is not encouraging.
16
https://www.image-net.org/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 397
This problem of the increase in intelligence with the increase in resources has been most
thoroughly investigated by OpenAI [9]. A core idea of their research has been to extend the
sizes of the data and models, to scales never seen before, and then plot the relationships.
Using loss on the test data set as their measure of intelligence (lower loss indicates higher
intelligence and is covered in Section 9.2), openAI researchers asked how the loss reduces
as a function of resources. They explored different forms of resources: computation in float-
ing point operations, the size of the training data set, and the number of parameters in the
model. And for the first time they went to huge scales. They investigated what happens at
over one petaFLOP, over one billion parameters, and at billions of data points in the training
data set. What they found was always the same relationship – a power law17. The demands
on resources kept growing at the same pace that the demands on the models’ intelligence
increased.
In the original paper, the authors focused on the positive side: that intelligence keeps in-
creasing. The models never stop becoming better. However, another side to this finding, of
which we also must be aware, is that there is never a point at which the power law-driven
demand in resources stops. There is no point at which the model becomes so smart that it
no longer requires ever more resources for every additional step of increasing its intelli-
gence. Therefore, the situation is as follows: Yes, we can grow the intelligence of deep learn-
ing models indefinitely. However, this comes at a price: we need a power law growth in
resources. In practice, then, it seems that we in fact cannot grow intelligence indefinitely.
The indefinite growth in intelligence is only ‘in theory’, only if we had unlimited resources.
But our resources are limited. Hence, we will have to stop at some point. These findings from
openAI tell us that it will likely be impossible to grow the intelligence as far as we would like
to.
We already see clearly how our dependency on ever-increasing sums of resources is depriv-
ing us from using the AI solutions in practice. While state-of-the-art models from only a few
years ago could be implemented and trained on individual computers and even at home, the
best models of today can be played and experimented with only by an exclusive club of
organizations, those that can afford 10s or even 100s of millions of dollars in hardware and
data. This problem is also visible in the technology for self-driving cars. Automotive compa-
nies are able to deploy much more accurate AI solutions for self-driving cars on gigantic
super-computers than what they can do with the relatively miniscule computational units
which will fit under the hoods of individual cars. A part of the challenge of creating truly
autonomously driving vehicles is the power law discovered by openAI: intelligence requires
a disproportionate increase in resources. This is true for training data, computer memory,
and computational power.
Therefore, although the intelligence of deep learning can be scaled, it does not scale well.
And if you are not aware of this problem, you may fall into a trap. The next question is, then,
17
The power law is slightly less explosive than exponential functions and yet, still poses ever-accelerating growth in
demands on resources. An example of an exponential function is bx, where b is the fixed base and x is the variable,
such as the time passed since we started building computers. So, for example, if we double the computational
power every year, b = 2. A power law function would be xn, where n is the fixed exponent. For example, if we solve
relationship problems for x objects in a 3-dimensional space, then the computation time will scale proportionally
with x3. They both “explode” because their derivatives explode: the derivative of an exponential function is another
exponential function; the derivative of a power function is another power function.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
398 13 Building Great Artificial Intelligence
what can we do in practice to prevent falling into that trap? Also, are there remedies to this
problem of scaling?
First and foremost, do not ever promise that you will develop a large-scale functioning prod-
uct based on a small-scale demo. Before you make such a promise, you have to have ad-
dressed the scaling problem. You must ask the question: can I scale it well enough? Is the
power law going to destroy me, or will I have enough resources (data, memory, computa-
tional power)? Be wise about scaling the intelligence.
You may need to do some thinking and analytics before you will know whether your prob-
lem will be able to scale. You can:
1. Assess the exponent of your power law,
2. Reduce your ambitions,
3. Reduce the exponent.
To assess the exponent of your power law, you can do an analysis on a small scale to see how
well your model and data work together. Then you can extrapolate the obtained results to a
larger scale. The extrapolation will give you an assessment on whether you will have enough
resources to achieve the levels of intelligence that you seek. An analysis on a small scale
will include stepwise creation of models of different sizes (or different data sets sizes, or
different accuracy levels) much like openAI did in their analysis. You can then plot these
relationships. You will also need to define intelligence. openAI has used loss as a proxy for
intelligence, which was a convenient measure. Depending on your problem, you may need
another measure. If your goal is not to reduce loss but to increase the total number of cate-
gories that your model is able to distinguish while keeping the loss unchanged, then the
intelligence may need to be defined as the number of categories successfully distinguished
(where ‘successfully’ means at a pre-defined loss). If these estimates tell you that you will
have enough resources to boost your model to the level you desire, then go straight for it.
Collect your data, get your computational power and start training. But if the estimates tell
you that the total amount of resources exceeds what you can afford, then perhaps there is
no point even in trying. You have just computed that this is doomed to a failure. We do not
want to do ‘mindless’ boosting of our resources without first doing such research. Instead,
we want to be smart like openAI was: They performed such research to help develop their
famous GPT-3 model18.
Reducing your ambitions is another sensible step. For example, you may decide to build a
level 2 autonomous driving, not level 4 or 5.19 This will be then what you will promise to
your stakeholders and to your customers. That way, you will ensure that your AI project
does not fail.
Finally, you can also reduce the exponent of your machine learning algorithm. This is prob-
ably the best thing you can do. How does one reduce an exponent? The answer is always the
same: We insert appropriate inductive biases. As mentioned already, research on different
versions of machine learning models is nothing else but attempts to find inductive biases for
a given class of problems. Effectively, what successful attempts do is reduce the exponents
of the power law for a given problem. Therefore, research is what can help you tremen-
18
https://en.wikipedia.org/wiki/GPT-3 or see Chapter 17, NLP.
19
See here for different levels of autonomous driving: https://www.aptiv.com/en/insights/article/what-are-the-
levels-of-automated-driving
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 399
dously reduce the explosion rate with which the demands on your resources grow. Impor-
tantly, sometimes it will not be enough to read the literature and pick from there whatever
the state-of-the-art is. You may need to do your own research. The reason for this is that your
problem is likely unique and nobody else has really encountered it. Therefore, the inductive
biases that will help you the most are probably the ones that you need to discover on your
own.
To understand that it is sometimes better to invest effort into your own research of inductive
biases than in brute-force training of models, consider the graph in Figure 13.6. The two
curves depict the total resources needed for two different models. One model already exists
and hence does not require any initial research; we can start scaling it immediately. How-
ever, unfortunately, this model does not scale well – the exponent is too high – and this
hurts us only later, after we have already spent much money and time on building a bigger
model. Only later do we realize that we will not be able to achieve our goals with this
approach. Another model requires high initial costs into research of inductive biases, and
hence, it does not seem attractive at the start. However, the research effort reduces the
exponent of the power law, which then has a cumulative effect over time. Furthermore, only
with this approach are we able to build a model that reaches the needed levels of intelli-
gence with acceptable costs. Therefore, although this approach initially seems too expen-
sive and a waste of time and resources, in the long run, this is the only way that works. A
great thing is that there are ways to decide based on quantitative analysis which of the two
strategies we should choose: If an extrapolation of our initial analysis of exponents indicates
that the demands on resources are too high, we’d better choose the research route. Other-
wise, we chose the route of scaling the existing model.
Figure 13.6 Two models, A and B, for which the demands on resources grow with the power law.
Model A has low initial costs but high costs later. Model B has high initial costs due to the research
needed to find the proper inductive biases, but these in turn reduce the exponent of its power law.
Model B thus gives us a better return on investment.
In our (re)search for inductive biases, it is allowed to go beyond deep learning. Deep learn-
ing has inherent limitations that will always follow the aforementioned power law, no mat-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
400 13 Building Great Artificial Intelligence
ter how wisely we create the architecture of the network or how effective is the learning
algorithm we build. New deep learning approaches will reduce the rate of demand growth
but, nevertheless, this growth will still have an exponent a lot larger than one. If we want to
get rid of power law entirely, we have to get rid of deep learning. This is often possible in
some components of the overall AI solution but in many cases, deep learning is still the best
we can do.
To understand why and how we could possibly get rid of the unfortunate power law in ma-
chine learning algorithms, it is necessary to understand where this power law comes from
in the first place. And for that we have to resort to Lego bricks. Everyone who has played at
some point with Lego will intuitively understand why it is impossible to make a perfect
sphere out of Lego bricks. A brick is a cuboid that has six sides and 12 edges, and its shape
is not suited for making the smooth surface of a sphere. What one can do is combine a num-
ber of bricks to make an approximate sphere. The result will be an approximation with some
error. As Figure 13.7 shows, however, the sphere will still be rough under our fingers.
The roughness of the sphere will depend on the number of bricks used. With fewer bricks,
the sphere will be quite rough; with many bricks, the roughness will reduce. We can make
quite an accurate parallel to deep learning models by taking Lego bricks as a model of a
sphere, the number of bricks as representing the number of parameters in the model and
the shape that we make out of the bricks as representing a trained model. Finally, the rough-
ness of the surface is our loss.
So, the question is: How does the needed number of bricks (the resources) increase with the
reduction in loss (the intelligence)? If you plot this dependence, you will get the same result
as openAI got for deep learning models: the relationship is a power law. Moreover, the expo-
nent of this power law is exactly 3, because we are building a 3D object.
Figure 13.7 A sphere approximated by Lego bricks illustrates the problems machine learning
models face when approximating real-world phenomena. More bricks will make the approximation
smoother but never perfect; as we increase the smoothness, the number of bricks grows as power
law.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 401
This parallel illustrates quite accurately where deep learning models struggle. Elementary
computation functions such as ReLu and sigmoid functions play the same role as the cuboid
bricks in Lego models. The real world, be it written or spoken language, or images taken by
a camera of an autonomous vehicle, is not made out of ReLu and sigmoid functions. These
functions only approximate the real world, and this poor fit between the two sides is the
ground for a power law relationship. Whenever you have a general elementary component
that can approximate almost anything, be it a Lego brick or deep learning ReLu function,
you will necessarily observe a power law.
That is, unless your elementary component just happens to fit perfectly into the real-world
problem. If you need to approximate another cuboid with Lego cuboids, then the approxima-
tion smoothness will not suffer from the mentioned limitations. Quickly we will come to the
point where the approximation is perfect and smooth – no improvement needed. In the case
of deep learning, we have seen something similar in the case when the ‘real world’ is cre-
ated by another deep learning network; in that situation there is the needed perfect fit. So,
all we need to do is create a model whose elementary units fit the real world better. This is
sometimes the case with for example, decision trees. There are data that can be nearly per-
fectly fitted by a decision tree while at the same time are difficult to approximate with deep
learning.
The most powerful approach is to create your own model, one that is suited exactly for the
data you are dealing with. I have given the example of a model specialized for crypto cur-
rency markets. We can build one for anything. All we need to do is a) understand the nature
of the problem, and b) describe this nature in a form of mathematics or logic. This is all.
The parallel to playing with Lego pieces is using pieces of brick other than cuboids. Lego
started with cuboids but soon after started producing pieces of other shapes. For example,
there are Lego wheels. With cuboids, it is difficult to approximate wheels. So, to enable
children to build cars, they manufactured specific pieces in the shape of wheels. This is
what you need to do with your model. Find out which ‘shape’ is needed and then ‘manufac-
ture’ it. The process of finding out what is needed is analytics, the process of manufacturing
is the formation of inductive biases.
If you do that process well, you can escape the power law. You can create a model that is
perfect for your data. I have personally been lucky enough to have been trained during my
PhD in building such models for data obtained in neuroscience and behavioral science. The
fields of computational neuroscience and mathematical psychology are in desperate need
for building such models.
13.3.5 R
ecommendation No. 5: Beware of the Generality Trap
(there is no such a thing as free lunch)
As you will frequently need to come up with new ideas for inductive biases in your models,
you will likely be tempted to reuse something that already exists out there in the literature.
Someone has already gone through a similar process, tried out an idea and reported that it
was working for them. Why not reuse it for your project?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
402 13 Building Great Artificial Intelligence
There is no problem in reusing, except that you need to be wary of one more trap that is
lurking for you. This is the trap of apparent generalization. When you observe how seem-
ingly effortlessly this new method works on all the various examples the authors of the
paper show, you will find it hard to resist the idea that it must work for you, too. After all,
the logic of the idea seems solid, and the results show excellent performance on a variety of
tasks. Moreover, it seems that also the intelligence scales well on the data used by the
authors. What could go wrong?
At first it may even appear that the new method finally presents a solution you have been
looking for all your life, a panacea for all your machine learning challenges, something that
works well always and for all problems. Not only will this magical solution help you now, but
also from now on, for all your problems. Certainly, given the way some research papers are
written, there is nothing to suggest otherwise. Often, there is no mention of cases in which
the said method does not work, and nothing is written to indicate that you may be looking
at a method with narrow applicability.
An unfortunate truth of machine learning is that it is mathematically impossible to create
an ultimate algorithm that works for all possible problems. As we have seen above, an algo-
rithm cannot generalize to a wide variety of problems and be effective at the same time. Now
we must expand this wisdom: For every machine learning algorithm exists some data that
the algorithm cannot even possibly learn. I am not talking here about poor scaling. Rather,
I am talking about a total inability to achieve any performance above the chance level.
It may be obvious that a sine wave cannot describe a square shape, or that a linear equation
cannot describe a circle. However, it is not clear that there is something that a deep learning
algorithm cannot learn. The limits of complete AI solutions that rely on deep learning can
be obvious. For example, alphaGo can only play the game of go. Watson can only play Jeop-
ardy. Neither of them can drive a car. A car-driving AI cannot play Jeopardy or go. However,
aren’t we told that an artificial neural network with many ReLu functions can approximate
any mathematical function? It is enough to have just one hidden layer and a machine can
approximate anything given enough data and enough units in the hidden layer (see footnote
3). How can I then say that it is not possible to create a panacea algorithm? Deep learning
in a way seems to be just that. Unfortunately, this is not true.
To understand why no algorithm can do it all and why this applies to deep learning too, let
us go back to our Lego pieces. Are there things in the world that cannot be approximated by
Lego bricks? Sure, you can approximate any solid object, a house, bridge, tree, shape of a
person and so on. But there are a few things that you cannot approximate on any satisfac-
tory level. We already mentioned wheels that kids love to have on their toy cars. A rolling
wheel simply does not work with bricks. Another example is a squishy sponge. In general,
Lego does not work with states of matter other than the solid state. It cannot do fluids, gases,
or plasma. Also, generic Lego bricks are not enough to implement electric circuits as they
isolate but do not conduct electricity, not to mention impossibility to have light bulbs or
electric motors. Therefore, although Lego bricks are able to approximate some aspects of
reality, sometimes with poor scaling capabilities but recogniseable nonetheless, there are
parts of reality they cannot address.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 403
Similar problems hold for deep learning. Consider for example a pseudo random number
generator such as Mersenne Twister20. You cannot implement such a generator using deep
learning in any satisfactory way. Mersenne Twister has a simple algorithm; the number of
operations is ridiculously small in comparison to any deep learning algorithm and yet, it
generates random numbers practically indistinguishable from true randomness. So, can we
just use Mersenne Twister to train a deep learning network with one hidden layer and
achieve about the same? Unfortunately, no. The Mersenne Twister algorithm is so much
different from ReLu and sigmoid functions that the problem is about as difficult as creating
a wheel or fluid out of Lego bricks. You could not even overfit Mersenne Twister and have
deep learning memorize its sequence. A computer of the size of our Milky Way galaxy
would probably not be enough.
There will always exist problems for machine learning algorithms that they cannot tackle.
Even a most general form of deep learning – a single hidden layer – is not general enough.
The limitations are then even larger for more specialized models that have more than one
hidden layer and have other inductive biases to make them perform better on specific data.
There is even a mathematical theorem proving that this will always be the case: You cannot
create a machine learning algorithm that can work for all possible problems. This proof is
popularly called a “no free lunch theorem”.21 The implication of the theorem is basically that
there is no free lunch in machine learning: If you change the inductive biases of your model
to gain something, you will necessarily have to pay the price; you will at the same time lose
something. What you will lose is exactly the opposite of your inductive bias. If you assume
ReLu transfer function, you will make it more difficult to learn everything non-ReLu; if you
assume sine waves, you will have difficulties with linear functions. And so on. Also, if you
create a more elaborate model that combines sine waves and ReLu, you will still lose some-
thing. You will reduce the capability to learn and hence, you will need more data due to the
larger number of parameters. And so on. This is a game in which you can never win. There
will never exist such a thing as a truly ‘general’ learner algorithm.
This property of machine learning forms a trap for the reason that it is counterintuitive. Our
intuition tells us that, if we think hard enough, we should be able to come up with an algo-
rithm that can do the magic: learn fast and under all conditions. Unfortunately, both the
math and practical experience tell us that is not true. Such a magic algorithm would again
be a form of a machine learning perpetuum mobile – and if we try to build one, we will just
be wasting time.
So, what can we do to protect ourselves from this trap? How do we suppress our seductive
intuition? First, we must stay aware, regularly reminding ourselves that there are no free
lunches in machine learning. A good way to do this is not to ask questions like “Which pow-
erful idea can I come up with that will solve all the problems?”. Instead, we should ask
something along the lines of “Which trade-off can I make; what am I ready to give away and
what can I hope to obtain in return?” The latter type of questions may guide you to make
more realistic design decisions for your AI architectures.
20
https://de.wikipedia.org/wiki/Mersenne-Twister
21
https://en.wikipedia.org/wiki/No_free_lunch_theorem, https://machinelearningmastery.com/no-free-lunch-theo-
rem-for-machine-learning/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
404 13 Building Great Artificial Intelligence
A common trap is in the published work. When we read papers on fresh new algorithms, the
papers often present only one side of the picture. They tell you how well the algorithm
works on their data. But what they typically fail to do is to explain which price they had to
pay for their lunch. Usually, this part is entirely skipped, and one is left with the impression
that the new algorithm only brings benefits and there are no disadvantages to it. As a min-
imum, we have to be wary of that fact. We have to read between the lines and detect by
ourselves the price for the lunch being offered. We have to then know whether we are will-
ing to pay this price – or whether we are at all in the position to make the payment. It is
better to ask this question immediately, while reading the paper, than discovering the same
answer the hard way – after several months of effort with trial and error in unsuccessful
attempts to make the method work for you (although, sometimes there is no other option but
trying things out the hard way).
One thing that helps a lot to encourage such thinking is getting good training in traditional
statistics. Traditional statistics is based on assumptions such as the Gaussian distribution,
linear relationships, independence of sampling, homoscedasticity and so on. These assump-
tions are the inductive biases of statistical methods. For example, GLM is biased to find
linear relationships. By stating assumptions, as is a tradition in statistics, one is basically
stating the conditions under which the method works. The assumption is the price to pay:
your data need to meet the assumptions. In other words, the authors of the paper in statis-
tics say: If your data do not meet these assumptions, nobody knows whether and how well
the method will work for you. In statistics, there is even a whole sub-field of investigating
how robust statistical methods are on any violation of assumptions. Statisticians are quite
disciplined about the prices being paid for lunches. Learning statistics means being trained
in thinking this way.
But we don’t have this type of discipline in deep learning. The published papers usually get
away without pointing out limitations. And this is somewhat understandable. Machine
learning problems are mathematically so much more complex in the background than the
mathematically elegant models of statisticians. It is much more difficult to track what is
going on behind the scenes. But this is also unfortunate and can create an ugly trap for us.22
The history of machine learning algorithms is packed with examples of methods that every-
one got overly excited about at the beginning, only to later realize that these methods are not
a cure-all and that they are, instead, just one of the many tools that you may try to use. For
example, people are now a lot less excited about long short-term memory (LTSM) networks
(see Section 12.6.7) than when they were first proposed. Back then, it was almost expected
that these networks would solve all our problems with time series and temporal dependen-
cies within data. Today, they are only one of many tools available.
Another example is the Adam optimizer23 – a learning algorithm – which at the beginning
seemed like a solution to all our gradient descent-based problems [10]. Over time, it became
clear that one is paying a price, the lunch the optimizer provides is not free. Today, Adam
optimizer is just one of the many tools that we can use to train our deep learning models,
and it keeps improving [11]. What can help you judge whether an algorithm is suitable for
22
It would be nice if all the machine learning articles were required to have a section somewhere towards the end
entitled something like “The price paid for the lunch.” I bet many people would read this section first, in a way
people sometimes read the menus in restaurants starting from the right side.
23
https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
13.3 Five Recommendations for Designing an AI Solution 405
your data is understanding the inner workings of the algorithm. They describe its inductive
biases. If you know these inner workings, you can mentally simulate them and try to think
whether they seem likely to work on your data. You can do that with multiple learning algo-
rithms and then pick the one that seems the best. However, sometimes, we simply must try
out various options to discover which is the most appropriate.
Learning algorithms are generally an interesting topic when it comes to the lack of free
lunches in machine learning. One must keep in mind that deep learning does not constitute
only the transfer functions like ReLu and the network architectures. A critical component is
the learning algorithm that sets up the weights of the network. As a rule, these algorithms
are quite limited. We usually rely on some form of gradient descent (see Section 9.2), but
gradient descent requires that data have certain properties, and has its own inherent prob-
lems, the most notorious being the fact that it tends to get stuck in local minima and usually
does not find the global minimum. This means that our learning algorithms almost never
find the best solution possible. An interesting fact is that we have an algorithm that can find
the best solution (the global minimum) and this algorithm is even simple to code into a
computer program. The algorithm relies on a brute-force search of all possible states of the
parameters. The only problem is the lunch price that needs to be paid: the computational
time would be way too long. For real, practical problems, we are talking here about age-of-
the-universe long. Therefore, we have to undergo trade-offs and get thinking, using our
human intelligence to come up with some smart way of learning with an affordable price.
It would be naïve to think that one could train a deep learning network to act as a learning
algorithm. We have to do something else, something outside of deep learning. To under-
stand that, let us go back to our Lego world. In Lego, the equivalent of learning is the
process of arranging Lego pieces into the form of the final object that we want to build.
Normally, it is a human child or adult who would do the arranging process (adults also love
to play with Lego even if they don’t want to admit it). But consider a situation in which you
need to create a machine to arrange Lego pieces. For example, you may give the machine a
photograph of an object and ask the machine to build such an object. Now comes the key
question: can you stay loyal to Lego and use Lego bricks to create the assembly machine?
There is a problem right there. For one, Lego bricks don’t even remotely have the capability
of doing something like robotic manipulation of pieces, as would be required.24 Similarly, in
deep learning, you have to jump out of the ReLu and sigmoid functions to train ReLu and
sigmoid functions. The learning algorithm must know how to do exactly the things that the
ReLu does not know how to do.25
And this brings us to our final and perhaps ultimate thought about how to defend ourselves
against the no-free-lunch-theorem. We have to use our minds to do great engineering. We
must add pieces of algorithms that jump out of the world of the other algorithms that we are
already using. In the end we will have, as I already stated multiple times, an elaborate
machine containing multiple pieces. Some of them will rely on machine learning. Others
will be best described as GOFAI. You need both and you will have both. Finding a good solu-
24
The reason my comparisons with Lego bricks work so well in this text is because I am not merely making a
metaphor here. Lego is as much a model of the world as is any other mathematical model. I am hence comparing
two equals: Lego base model to deep learning based model.
25
For those who want to go deeper into the issues of assembly machines and what would it take for machines to
assemble themselves without human help, I suggest reading about the theory of autopoiesis (meaning something
like ‘self-assembly’) and the theory of practopoiesis (meaning ‘assembling by acting’).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
406 13 Building Great Artificial Intelligence
tion that works for your customers will partly require you to use your knowledge on existing
algorithms and partly it will be a work of art. And, as always, we need a pinch of luck as we
cannot control everything. I wish you good luck.
Similarly, we are still yet to find an AGI paper that acknowledges the existence of the
no-free-lunch-theorem and designs its research efforts around this. The human brain must
be paying some price for the intelligence it gains. So must any AI approach that mimics this
intelligence.
Cognitive science knows very well a few (expensive) prices that we, the owners of the
human brain, get invoiced for on a daily basis. For example, our short-term memory (a.k.a.,
working memory) is notoriously incapable of memorizing more than a handful of pieces of
information. Try to memorize a random sequence of 12 digits in a single trial. If you haven’t
been specifically trained for such tasks, you can’t do it. For machines, of course this is triv-
ial. This inability to memorize random information is the price we pay for the intelligence
we have. The reasons for this lie in the fact that our minds rely heavily on concepts. Another
example of the price we pay is that we easily get distracted. Machines can stay focused
within their for-loops until the tasks are finished. Our minds cannot do that.26 Therefore, if
you see a proposed AGI approach that does not have a similar price list as the human mind,
perhaps this is not really going in the right direction.
I haven’t yet seen an approach towards achieving human-level intelligence that would
satisfy these criteria – one that would be able to deal with the power-law scaling problem
and would consider the no-free-lunch-theorem.27 This leads me to conclude that the word
‘general’ within the term AGI is an unfortunate choice. From what we have seen before,
there could exist no such thing as ‘general’ intelligence. If you want to get somewhat gen-
eral, you have to pay the price for having to train the algorithm with vast amounts of data.
And it is impossible to be absolutely general due to the no-free-lunch theorem. An example
of the price paid for some generality is one of the most advanced deep learning models for
natural language processing, called GPT-3 (see Chapter 17). The amount of data on which
this model was trained is probably a million times bigger than the amount of language ex-
posure a human person gets in their life28. And yet, humans can be more creative in using
language than can GPT-3. But we also pay a price for it: we cannot store into our memory the
vast amounts of information that GPT-3 can. We also cannot generate texts even close to the
same number and speed as the machine can. We are slower. We have to think. Having to
take the time to think is also one of the prices that our brains pay for its intelligence.
GPT-3 is somewhat general too – in some ways a lot more general than the human mind.
Still, this deep learning monster cannot do basic things that the human mind can. There-
fore, using the term ‘general’ for an AI approach that has the ambition to mimic the human
brain is misleading: It gives the researchers the idea that they should be looking for some
sort of algorithm that is general and can learn anything. But this is
1. mathematically not possible due to the no-free-lunch theorem,
2. not sustainable due to the power law increase in resources, and
3. clearly not how the human brain works.
26
More information on the differences between human and artificial intelligence can be found in this talk of mine:
https://bit.ly/3tJyScg. Also, a talk on my two cents on what we need to pay attention to in the brain:
https://bit.ly/3hzA9h4
27
One more issue that needs to be resolved when achieving human-level intelligence is the problem of the generated
variety of responses, which I did not discuss in this chapter but one can read about in [12].
28
GPT 3 was exposed to some 260 billion tokens. In contrast, there are only about 20 million seconds in a person’s
lifetime.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
408 13 Building Great Artificial Intelligence
I think that, considering the power law and the no-free-lunch theorem, it is clear that who-
ever is looking for a general algorithm is looking for the equivalent of a perpetuum mobile
in machine learning. General intelligence seems intuitively possible, but unfortunately can-
not exist. Don’t get lured into this trap during your own efforts when creating an AI product.
You are guaranteed to end up at a dead end. There is no such thing as general intelligence
and likely will never be. This also means that we humans do not possess a general intelli-
gence: We are good at solving certain types of problems and horrible at solving many other
types of problems.
■ 13.5 In a Nutshell
Figure 13.8 This devil’s cycle is what makes machine learning difficult and each
successful AI product a piece of art
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 409
References
[1] Nikolić, Danko, et al. “Distributed fading memory for stimulus properties in the primary visual
cortex.” PLoS biology 7.12 (2009): e1000260
[2] Nikolic, Danko, et al. “Temporal dynamics of information content carried by neurons in the prima-
ry visual cortex.” NIPS. (2006)
[3] Lazar, Andreea, et al. “Visual exposure enhances stimulus encoding and persistence in primary
cortex”, PNAS, (2021)
[4] Hecht-Nielsen, Robert. “Kolmogorov’s mapping neural network existence theorem.” Proceedings
of the international conference on Neural Networks. Vol. 3. IEEE Press New York, 1987
[5] Lawrence, Steve, C. Lee Giles, and Ah Chung Tsoi. “Lessons in neural network training: Overfit-
ting may be harder than expected.” AAAI/IAAI. 1997
[6] Bodie, Zvi, and Alex Kane. “Investments.” (2020)
[7] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the
IEEE 86.11 (1998): 2278-2324
[8] Bianco, Simone, et al. “Benchmark analysis of representative deep neural network architec-
tures.” IEEE Access 6 (2018): 64270-64277
[9] Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arX-
iv:2001.08361 (2020)
[10] Bae, Kiwook, Heechang Ryu, and Hayong Shin. “Does Adam optimizer keep close to the optimal
point?.” arXiv preprint arXiv:1911.00289 (2019)
[11] Kingma, Diederik P., and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint
arXiv:1412.6980 (2014)
[12] Nikolić, Danko. “Why deep neural nets cannot ever match biological intelligence and what to do
about it?.” International Journal of Automation and Computing 14.5 (2017): 532-541
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Signal Processing
Jana Eder
■ 14.1 Introduction
A signal is a representation of data over time or space, such as an audio signal capturing
the variations in sound waves over time, or an image signal representing the distribution of
light intensity and color across a two-dimensional space (see Figure 14.1). In mathematical
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
412 14 Signal Processing
terms, a signal x(t) can be represented as a function of one or more independent variables,
where:
For an audio signal, s(t) represents the variation of pressure (sound waves) over time t.
For an image signal, I (x, y) represents the intensity or color of the image at a spatial posi-
tion (x, y).
Signal processing plays an important role in data science by enabling the transformation
and feature extraction of raw data to allow valuable insights for decision making or prepar-
ing data for machine learning models. It includes techniques for enhancing data quality by
filtering out noise and extracting relevant features, thereby facilitating clearer insights and
more robust decision-making.
In the context of machine learning, algorithms learn patterns from data to make predictions
or decisions. However, the presence of noise or irrelevant information can significantly im-
pair the learning process, leading to inaccurate outcomes. Signal processing addresses
these challenges by pre-processing the data—removing noise, highlighting important fea-
tures, and normalizing data formats.
For instance, in applications such as speech recognition or image analysis, signal process-
ing ensures that only the most relevant and clean data is presented to the machine learning
model. This step is essential because it directly impacts the algorithm’s ability to learn ef-
fectively and make accurate predictions. Thus, signal processing serves as a bridge between
raw data and machine learning, ensuring that data scientists can unlock the full potential
of their models.
Aliasing: If the Nyquist-Shannon sampling theorem is not fulfilled, aliasing occurs. In more
detail, the signal is undersampled below the Nyquist rate, causing higher frequency compo-
nents to be indistinguishably mirrored into lower frequencies. This mirroring effect leads to
distortion in the signal’s reconstruction, making higher frequencies appear as their ‘mirror
frequencies’ in the lower spectrum. Figure 14.2 c) illustrates the distortion resulting from
insufficient sampling frequency fs, a phenomenon called undersampling. Such distortion
leads to complete different sampled signal, making it unusable for machine learning or
signal processing algorithms. No existing algorithm or machine learning approach can re-
store an undersampled signal to its original form as shown in Figure 14.2 c).
1
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal. 27 (1), 379–423,
623–656.
2
Shannon, C. E. (1998). Communication In the Presence of Noise. Proceedings of the IEEE. 86 (2), 447–457.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
414 14 Signal Processing
Quantization introduces a quantization error due to the approximation of the signal ampli-
tudes to the nearest quantization level. By combining sampling and quantization, a contin-
uous-time, continuous- amplitude signal is converted into a discrete-time, discrete-ampli-
tude signal, which can then be processed, stored, and transmitted by digital systems (see
Figure 14.2). The digital representation of the original signal can be expressed as a se-
quence of quantized samples:
where n is the sample index, is the sampling period (the time between samples),
and s[n] represents the digital signal resulting from the sampling and quantization pro-
cesses.
In quantization, the number of bits determines the resolution. A higher bit depth allowing
for finer gradations between the minimum and maximum values that can be digitally rep-
resented. For example, with 3 bits, the number of quantization steps is 23 = 8 providing
eight discrete levels for representing the signal (as shown in Figure 14.2).
Quantization Example: To illustrate quantization, we consider an electrical signal with a
5-volt (signal range of −5V to +5V a total range of 10V) amplitude. The example in Ta-
ble 14.1 shows how the signal is quantized with different bit depths: 4-bit, 8-bit, and 16-bit,
and what the quantization step size Δ is for each. Δ is calculated as ,
where the number of levels is 2bits.
Quantization Error: Quantization error is the difference between an analog signal and its
digital approximation due to finite resolution. As the resolution increases, which is deter-
mined by the number of bits, the quantization error decreases.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.2 Sampling and Quantization 415
Figure 14.2 Digitalization of an analog signal using a microphone: a) sampling of the analog signal
with sufficient sampling frequency fs (green) and insufficient fs (red). The insufficient sampling fre-
quency leads to aliasing. In b) quantization is shown. The resolution of the quantization is deter-
mined by the number of bits. In c) the final discrete (digital) signal is shown after sampling and quan-
tization for the sufficient (green) and insufficient (red) fs which is distorted due to aliasing.
The frequency domain enables us to compresence information that is complex in the time
domain in a simpler way in the frequency domain. The basic idea is that every signal, no
matter how complex, is created from a sum of sine and cosine signals. The frequency with
the corresponding amplitude is then mapped in the frequency domain (see Figure 14.3).
Complex mathematical functions such as convolution (the most fundamental operation in
convolutional neural networks) in the time domain is simple point wise multiplication in
the frequency domain3. We can use this to significantly speed up our (machine learning)
algorithms.
Further, the frequency domain helps to uncover repetitive patterns and underlying frequen-
cies in data, enabling feature extraction and noise reduction in machine learning models.
For example, in analyzing stock market trends, it can isolate cyclical behaviors from the
noise, aiding in the prediction of future movements.
The Fourier Transform (FT) is a mathematical transform that decomposes a function of time
(a signal) into its constituent frequencies. The transform is defined by the formula:
3
Oppenheim, A. V., & Schafer, R. W. (1989). Discrete-Time Signal Processing. Prentice Hall.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.3 Frequency Domain Analysis 417
where f(t) is the original time-domain signal, F(w) is the signal in the frequency domain, ω
is the angular frequency with w = 2pf, and i is the imaginary unit. While the FT shows the
transform of an analog signal, the Discrete Fourier Transform (DFT) is a discrete equivalent
(digital signal) of the continuous Fourier Transform used on digitized which means sampled
and quantized signals. The fast Fourier transform (FFT) is an algorithm for efficiently calcu-
lating the discrete Fourier transform (DFT).
Figure 14.4 Complexity Comparison between Discrete Fourier Transform (DFT) and Fast Fourier
Transform (FFT).
Big O notation indicates how the completion time of an algorithms increases with data size.
O(n) is efficient, growing linearly with data; O(n2) less so, growing quadratically. O(log n) is
highly efficient, with minimal increase even as data size strongly increases5. With notable
increases in computational performance, the FFT has been applied extensively to tasks like
matrix multiplication and vector convolution. Whether you’re talking on the phone, watch-
ing TV, streaming music, using GPS navigation, or browsing the internet on a wireless de-
vice, the FFT plays a crucial role in processing the signals that make these technologies.
4
Cooley, J. W., and Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series
“Mathematics of Computation”, vol. 19, no. 90, pp. 297–301.
5
Knuth, Donald E. (1976). Big Omicron and big Omega and big Theta in “ACM SIGACT News”, 8 (2), 18–24.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
418 14 Signal Processing
While this operation is straight forward, it can take up a lot of computational time. Espe-
cially if n or k are quite large.
Simplification in the Fourier Domain: Applying the FFT to both s[n] and h[n], transform-
ing them into S[k] and H[k] respectively, the convolution of s[n] and h[n] in the time domain
translates to the multiplication of S[k] and H[k] in the frequency domain:
This transformation means that to perform convolution between two signals, we can:
1. Take the FFT of both signals to move them into the frequency domain.
2. Multiply the FFTs.
3. Take the inverse FFT of the product to convert it back into the time domain.
Example 2: Convolution in Spatial Domain - Images
In image processing, convolution is the process of applying a filter (or kernel) to an image.
This kernel is a small matrix that transforms the image, modifying the central pixel’s value
based on the weighted sum of its neighbors. The purpose of a kernel can be for example
sharpening, edge detection or noise reduction. While in traditional image processing those
filters where selected by experts, in convolutional neural networks those filters are learned
by the networks itself.
For an image I and a kernel K, the convolution operation at a pixel (x, y) is defined as:
This process is computationally intensive, especially with large images or kernels, as it in-
volves multiple arithmetic operations for each pixel in the image.
Applying the Fourier Transform to both the image and the kernel transforms them into the
frequency domain, represented as F(I) and F(K) respectively. In this domain, the convolu-
tion operation simplifies to a point-wise multiplication of the two transformed matrices:
This property significantly reduces computational complexity. The steps to apply a kernel to
an image via convolution in the frequency domain become:
1. Transform the image II and the kernel K to the frequency domain using the Fourier
Transform, resulting in F(I) and F(K).
2. Multiply F(I) and F(K) elementwise.
3. Apply the inverse Fourier Transform to the product to convert it back into the spatial
domain.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.3 Frequency Domain Analysis 419
14.3.1.3 Windowing
The FFT is a powerful tool for analyzing the frequency content of signals. However, applying
FFT directly to a finite segment of a continuous signal can introduce artifacts, such as spec-
tral leakage, which can distort the analysis. In theory, signals are infinite, but if we are
truncating a signal and apply an FFT to it, that truncation is equivalent to multiplying it by
a rectangular window of the clip size. Which corresponds to a convolution with the window
function in the frequency domain having poor spectral properties (see Figure 14.5). Win-
dowing is a technique used to mitigate these effects by decreasing the signal at its boundar-
ies before performing the FFT7.
Spectral Leakage: Without windowing, the FFT assumes the signal is periodic and repeats
indefinitely. This assumption rarely holds for real-world signals, leading to leakage. In Fig-
ure 14.5 a) visualization of the most common window functions can be found. They are
further described below:
Rectangular Window: Offers no windowing effect, equivalent to truncating the signal with-
out any smoothing. It provides the best resolution but the worst leakage.
Hanning and Hamming Windows: These windows taper the signal to zero at the end-
points, significantly reducing leakage at the expense of slightly reduced resolution.
Blackman Window: Offers even better leakage reduction by using a more complex tapering
formula, further smoothing the signal’s edges.
Figure 14.5 Visual overview of most window functions in signal processing in a) time domain and
b) frequency domain.
6
Heistracher C. et al. (2022). Minimal-Configuration Anomaly Detection for IIoT Sensors in “Data Science – Analyt-
ics and Applications”, Springer, 4 86–89.
7
Harris, F. J. (1978). On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings
of the IEEE. 66 (1), 51–83.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
420 14 Signal Processing
# Perform FFT
fft_original = fft(signal)
fft_windowed = fft(windowed_signal)
Figure 14.6 Example of the application of a window function: a) original signal overlaid with a
Hanning window, b) windowed signal resulting from a multiplication of original signal and window,
c) FFT of original signal – with spectral leakage and d) FFT of windowed signal with reduced spectral
leakage.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.3 Frequency Domain Analysis 421
Interpretation: The FFT of the windowed signal provides a cleaner spectrum, with reduced
leakage effects, making it easier to identify the frequency components (see Figure 14.6).
This comparison illustrates the significance of windowing in signal processing, especially
before performing FFT, to obtain a more accurate frequency domain representation.
Windowing mitigates spectral leakage because it gently reduces the signal to zero at its
boundaries, minimizing the abrupt discontinuities that cause the leakage. This makes the
signal appear more periodic to the Fourier Transform, reducing the spread of energy across
unrelated frequencies.
In Machine Learning: Employing windowing techniques when preparing data for a ma-
chine learning algorithm that utilizes the FFT can significantly enhance the analysis by
minimizing leakage-induced distortions. Without windowing, the algorithm risks prioritiz-
ing these distortions, leading to inaccuracies in signal interpretation.
Low-pass filters are typically used to remove high-frequency noise from signals and to pre-
vent aliasing during the digitization process. An application example is shown in Fig-
ure 14.7.
High-pass: High-pass filters operate inversely to low-pass filters, attenuating frequencies
below a certain cutoff frequency fC , and allowing those above fC to pass. The transfer func-
tion of an ideal high-pass filter is given by:
These filters are useful for eliminating low-frequency noise or drifts in signals and for em-
phasizing high-frequency components.
Band-pass: Permit only a specific range of frequencies to pass through, attenuating signals
outside this range. They can be viewed as a combination of low-pass and high-pass filters.
The transfer function of an ideal band-pass filter is:
8
Oppenheim, A. V., & Schafer, R. W. (1999). “Discrete-Time Signal Processing.” Prentice Hall.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.4 Noise Reduction and Filtering Techniques 423
where fl , and fh , denote the lower and upper cutoff frequencies, respectively. Band-pass fil-
ters are indispensable in isolating specific frequency bands, particularly in communica-
tions and audio processing.
Figure 14.7 Visualization of the three main filter types in the frequency domain: low-pass, high-
pass and band-pass.
In Figure 14.7 frequency domain filtering are shown based on sigmoid functions. Practical
filters have a transition, or ‘transfer,’ band where the signal attenuation transitions gradu-
ally, not abruptly, leading to some signal distortion.
A low-pass filter is used for noise reduction by allowing only frequencies below a certain
threshold to pass through while attenuating (reducing) the higher frequencies that consti-
tute noise. In many signals, especially in audio or data signals, the useful information is
contained within the lower frequency range, whereas high-frequency components often
represent noise or unwanted sharp variations.
For example, in audio processing, a low-pass filter (kernel) can smooth (see Figure 14.8) out
high frequency hiss or static without significantly affecting the clarity of speech or music,
which mostly resides in the lower frequency range. Similarly, in image processing, applying
a low-pass filter helps in blurring and reducing high frequency noise, such as random
speckles, leading to a smoother appearance of the image.
Figure 14.8 Example of a gaussian low-pass filter (Kernel) for signals (left) and Images (right).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
424 14 Signal Processing
Figure 14.9 Denoising a Signal using a Gaussian Low-pass Filter (Kernel). Visualization shown in
the time domain.
Figure 14.10 Denoising an Image using a Gaussian Low-pass Filter (Kernel). Visualization shown in
the spatial domain.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.5 Time Domain Analysis 425
Signal normalization and standardization are two fundamental preprocessing steps in sig-
nal processing and machine learning. They are essential for modifying the features of a
signal to have a specific scale or distribution, which can significantly improve the perfor-
mance of algorithms.
Normalization: Normalization transforms signal values to a defined range, often between 0
and 1 or -1 and 1, standardizing the scale without distorting differences in value ranges. A
commonly used formula to scale values between 0 and 1 is:
where x is an original signal value, xmin and xmax are the minimum and maximum values in
the signal, respectively, and xnormalized is the normalized value.
Standardization: The signal is changed to have a mean of 0 and a standard deviation of 1,
which is commonly accomplished with Z-score normalization. This is especially useful for
algorithms that presume a normal distribution of the data. The standardization formula is:
where x is an original signal value, m is the mean of the signal, s is the standard deviation
of the signal, and xstandardized is the standardized value. Practical Application: Normalization
and standardization are essential in areas sensitive to the scale of data, such as optimization
algorithms based on gradient descent and nearest neighbors’ algorithms. These preprocess-
ing steps not only facilitate the learning process but also enhance the accuracy of predictive
models by ensuring uniformity in scale across features.
For instance, in image processing, normalization might adjust pixel intensity to fit within a
specific scale, while standardization could normalize audio signals for speech recognition
models, ensuring no bias arises from variance in loudness.
Descriptive statistics are helpful for summarizing and understanding the underlying charac-
teristics of signals in various domains, including engineering, finance, and biomedical re-
search. Signals, which can represent anything from electrical impulses to stock market trends,
encapsulate information that is often analyzed using statistics to extract meaningful insights.
Central Tendency Measures:
1. Mean (m): The arithmetic average of the signal values, calculated as , where
N is the number of observations and xi represents each value in the signal.
2. Median: The middle value when the signal values are arranged in ascending order. For
an even number of observations, it is the average of the two middle numbers.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
426 14 Signal Processing
Variability Measures:
3. Range: The difference between the maximum and minimum values in the signal, calcu-
lated as .
4. Variance (s2): A measure of the dispersion of the signal values around the mean, cal
culated as
5. Standard Deviation (s): The square root of the variance, giving a measure of the spread
of the signal values,
6. The Interquartile Range (IQR): is calculated as the difference between the 75th per
centile (Q3) and the 25th percentile (Q1) of a dataset. The formula for the IQR is:
Here, logb denotes the logarithm to the base b. If the base b = 2, the unit of entropy is bits,
indicating the average number of bits needed to encode the information content of the sig-
nal s.
Visualization: Signal analysis often requires the visualization of these statistical measures
to better understand the dynamics and behavior of the signal over time. Histograms, box
plots, and time-series plots are commonly used graphical techniques that offer a visual
summary of the signal’s descriptive statistics. These plots can highlight anomalies, trends,
and patterns that might not be immediately apparent from numerical measures alone.
1. Box plots: are useful for comparing several data sets at once. They show the center and
spread of a unimodal data set in a simple, standard format. The plots describe the center
and spread of a data set along one numeric variable using five summary values: the min-
imum, first quartile, median, third quartile, and maximum.
2. Histograms: Show the median, interquartile range, and outliers, providing insights into
the spread and central tendency of the data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.5 Time Domain Analysis 427
3. Violin plots: similar to box plots, except that they additionally display the data’s proba-
bility density at various values, typically after being smoothed by a kernel density esti-
mator. All the information found in a box plot will be included in a violin plot, along with
additional information such as all sample points, provided the number of samples is not
too great, and a box or marker denoting the interquartile range and the median of the
data.
Quiet Room (Class 1): Mostly low-level background noise, with a narrow distribution of
amplitudes centered around a lower value.
People Talking (Class 2): A broader distribution of amplitudes due to the varying loudness
of speech, with more energy in the speech frequency range (around 300 to 3400 Hz).
Machinery Noise (Class 3): High amplitude noise with a uniform distribution across a
wide range of frequencies.
Figure 14.11 Histogram, Boxplot and Violine Plot of three classes of audio environments: Quiet
Room, People Talking, and Machinery Noise.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
428 14 Signal Processing
Time-series decomposition is a technique used to break down a time series into several
components, each representing underlying patterns such as trend, seasonality, and random-
ness. Understanding these components helps in analyzing, modeling, and forecasting time
series data. There are two basic approaches to time-series decomposition: additive and mul-
tiplicative models9.
9
Hyndman, R. J., & Athanasopoulos, G. (2018). “Forecasting: Principles and Practice”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.5 Time Domain Analysis 429
An example for the respective components Trend Tt , Seasonality St and randomness Rt can
be found in the example be found in Figure 14.12 created using the python code below:
Python code:
Additional Considerations:
Transformation: Transforming the data (e. g., using a logarithmic scale) can
turn a multiplicative relationship into an additive one, making it easier to
model and interpret.
Domain Knowledge: Understanding the nature of the data and the underlying
processes can guide the choice. For example, economic data often fit multi
plicative models due to inflation and growth patterns.
Visualization: Plotting the data may help identify the presence of trends and
seasonality, as well as their relationship.
Autocorrelation is a concept that measures how similar a signal is to a delayed version of it-
self, across different time intervals. As a practical example, imagine you have a melody in
your head that repeats over and over again. The similarity of the melody to itself when it
starts again after a few seconds is an example of autocorrelation. In technical terms, autocor-
relation helps us understand whether and how a signal is correlated with itself over time10.
Autocorrelation is used in many areas of signal processing to identify periodic signals (for
example, in music or radio waves), reduce noise, and recognize patterns within time series
data, such as those found in economics or meteorology. It is also crucial in time series fore-
casting because it provides insight into which past values of a signal have the most influ-
ence on future values.
Calculating the autocorrelation of a signal requires comparing the signal to itself over dif-
ferent time delays. Specifically, you multiply the values of the signal at a certain point in
time by the values at a later point in time, sum these products over all points in time and
normalize the result. In mathematical terms, for a signal s(t), is the autocorrelation function
R(t) at a time delay t defined as the integral (or in the discrete form, the sum) of the product
of s(t) und s(t + t), normalized by the number of observations. This calculation is done for
different values of t to see how the signal correlates with itself over time. This results in the
definition for continuous (analog) signals as follows:
For discrete (digital) signals, as they often occur in digital signal processing, data science or
machine learning, the formula looks slightly different:
10
Box, G. E. P., Jenkins, G. M., & Reinsel, G. C. (2008). “Time Series Analysis: Forecasting and Control.” John Wiley &
Sons.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
432 14 Signal Processing
where k is the delay in discrete time steps (for example, clocks or measurement points) and
N is the total number of observations (or measurement points). This summation over all n
also measures the similarity of the signal to itself after a delay of k discrete steps.
Practical Application: A practical application of autocorrelation in data science can be
found in time series forecasting, specifically in financial market analysis. Autocorrelation
allows analysts to identify patterns and trends in historical price data to predict future price
movements.
Autocorrelation is a way of finding repeating patterns in data. The autocorrelation plot re-
veals how the returns are related to themselves over different time shifts. The vertical line
at a 10-day lag, where we see a pronounced peak, illustrates the strong autocorrelation at
this interval. When autocorrelation is positive at this 10-day lag, it suggests that if the stock
was performing well 10 days ago, it’s likely to show similar performance now.
Possible Trading Strategy:
Buy Signal: Leveraging the 10-day positive autocorrelation, a strategy could involve buying
stocks right before this pattern is anticipated to recur, aiming to capitalize on the expected
rise.
Sell Signal: Following the rise, selling the stocks after the pattern has manifested allows
investors to potentially secure gains, acting on the cyclical trend identified by the autocor-
relation analysis.
Additional Consideration for Risk Management: It’s important to set a stop-loss level to
manage the risk. If the expected increase doesn’t occur and the stock’s value falls, the stop-
loss will help us to minimize losses.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.5 Time Domain Analysis 433
The Short-Time Fourier Transform (STFT) analyses signals in both time and frequency do-
mains by applying the Fourier Transform to short, overlapping segments11. This is mathe-
matically represented as:
Where s[n] is the signal, w[n – m] is the window function isolating segments around time
m, and w denotes angular frequency. The spectrogram, a visual representation, is the mag-
nitude squared of the STFT:
This approach captures how the signal’s frequency changes, essential for analysing non-sta-
tionary signals like audio, sensor vibration or seismic data. The spectrogram’s color-coded
intensity plot facilitates the identification of frequency patterns and changes over time.
The Discrete Wavelet Transform (DWT) decomposes a signal into a set of wavelets or wavelet
coefficients, which can be thought of as a representation of the signal at different frequen-
cies and positions. For a discrete signal s[n], the DWT is defined as:
where:
Wj, k are the wavelet coefficients,
is the wavelet function at scale j and position k,
n indexes the discrete time samples.
11
Griffin, D. W., & Lim, J. S. (1984). Signal Estimation from Modified Short-Time Fourier Transform. IEEE Transactions
on Acoustics, Speech, and Signal Processing, ASSP-32(2), 236–243.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.6 Time-Frequency Domain Analysis 435
More details on topic of mother wavelets and their role in signal processing can be found in
the work Daubechies12.
The Gramian Angular Field (GAF) is a transformation technique that encodes time series
data into images by mapping the data to polar coordinates, enabling the application of im-
age processing algorithms to time series analysis13. The transformation process starts by
normalizing the time series signal s[n] to be between –1 and 1, usually using the Min-Max
normalization. Then, each normalized value is mapped to a polar coordinate system, with
the value determining the angle q and the radius r set to a constant since only the angle is
relevant:
, for normalized s[n]
The Gramian Angular Summation Field (GASF) and the Gramian Angular Difference Field
(GADF) are then defined as:
12
Daubechies, I. (1992). Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics. 66 (1), 1–357.
13
Wang, Z., & Oates, T. (2015). Encoding Time Series as Images for Visual Inspection and Classification Using Tiled
Convolutional Neural Networks. Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 29 (1),
1–7.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
436 14 Signal Processing
The distinct patterns observed with the spectrogram, wavelet GADF before the failure and
during motor failure and bearing failure can be used for failure type classification:
Feature Extraction: Extract features from the wavelet coefficients that characterize the
signal’s behavior in both the normal and failed states (see Figure 14.14). Features could
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.7 The Relationship of Signal Processing and Machine Learning 437
include the energy distribution across scales or specific frequency bands indicating normal
operation or failure.
Training a Classifier: Use these features to train a machine learning classifier (e. g., SVM,
Random Forest, Neural Networks) to distinguish between normal operation and various
failure types based on the extracted wavelet features (see Figure 14.14).
Real-time Monitoring and Prediction: Apply the trained classifier to the wavelet features
of real-time vibration data for continuous monitoring. The classifier can predict the motor’s
state and identify potential failures before they lead to breakdowns (see Figure 14.15)14,15.
Figure 14.15 Example of discrete wavelet transform for real time monitoring of a motor.
By leveraging the Wavelet Transform’s ability to capture both time and frequency informa-
tion, engineers can develop predictive maintenance systems that accurately classify and
predict motor failures, minimizing downtime and maintenance costs.
14
Kemnitz, J. et al. (2023). An Edge Deployment Framework to Scale AI in Industrial Applications. Proceedings of the
IEEE 7th International Conference on Fog and Edge Computing (ICFEC). Bangalore, India, 24–32.
15
Sepin, P. et al. (2024). Comparison of Clustering Algorithms for Statistical Features of Vibration Data Sets. Data
Science—Analytics and Applications.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
438 14 Signal Processing
Signal Processing and Machine Learning: Feature engineering is a critical step in the
machine learning pipeline, especially for signal classification. It involves extracting mean-
ingful attributes or characteristics from raw signals that can effectively represent the un-
derlying patterns or distinctions among different categories of signals. Good features cap-
ture essential aspects of the data that are relevant for distinguishing between classes while
being invariant to irrelevant variations.
Understanding Signals in Machine Learning: Signals, in their raw form, often contain a
wealth of information spread across different frequencies, amplitudes, and time intervals.
However, not all this information is useful for classification purposes. The aim of feature
engineering is to distill this information into a form that a machine learning model can use
to learn differences between classes of signals effectively.
Time-Domain Features: These include statistical measures such as mean, median, stan-
dard deviation, skewness, and kurtosis of the signal amplitude over time. Such features can
capture the signal’s central tendency, dispersion, and shape of the distribution.
Frequency-Domain Features: Through the application of Fourier Transform, signals can
be analyzed in the frequency domain, enabling the extraction of features related to the sig-
nal’s frequency content, such as power spectral density, band power, or specific frequency
components that are significant for classification.
Time-Frequency Features: Techniques like Wavelet Transform provide a combined
time-frequency analysis, useful for signals whose frequency content changes over time.
Features extracted can include wavelet coefficients, which capture both frequency and loca-
tion in time.
Statistical and Entropy Features: These features measure the complexity and unpredict-
ability of the signal, such as entropy, which can be particularly useful for classifying signals
that are chaotic or have complex patterns.
Dimensionality Reduction: Once features are extracted, techniques like Principal Compo-
nent Analysis (PCA) or Autoencoders can be employed to reduce the dimensionality of the
feature space, focusing on the most informative aspects while reducing computation and
the risk of overfitting.
Normalization and Standardization: It’s crucial to scale features appropriately to ensure
that no single feature dominates the ML model due to its scale, leading to more stable and
faster convergence during learning.
Feature Selection: Feature selection is an important step in the development of AI models,
helping to improve model performance, reduce overfitting, and decrease training time by
removing irrelevant or redundant features from the data. There are several methods for
feature selection, each suitable for different scenarios and types of data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
14.8 Practical Applications 439
Filter Feature Selection Methods: evaluate the relevance of features based on their intrin-
sic properties, independent of any machine learning algorithm. Metrics such as correlation
with the output variable or mutual information are commonly used.
This preparatory work in feature engineering and feature selection is essential for building
effective machine learning models. It bridges the gap between raw, complex signals and the
structured, simplified input that machine learning algorithms can interpret and learn from.
With a solid foundation in feature engineering, Chapter 12 will dive deeper into machine
learning techniques and algorithms that leverage these features for signal classification,
exploring both traditional methods and the latest advancements in deep learning.
■ 14.9 In a Nutshell
Signal Processing
Signal processing refers to analyzing, modifying and synthesizing signals. It is
an essential technique for data analyzing, extracting, and interpreting infor
mation from data signals. It enables the transformation of raw data into a form
suitable for further analysis or machine learning applications.
Sampling and Quantization
Sampling and Quantization are the foundation of digital signal processing. It
converts continuous signals into a digital format through sampling and defines
signal amplitude levels via quantization, setting the stage for computational
analysis.
Frequency Domain Analysis
A mathematical tool decomposes signals into constituent frequencies, facili
tating the understanding of signal properties in the frequency domain.
The computation of mathematical operations as convolutions are way more
efficient in the frequency domain.
Convolution
Convolution is a mathematical operation often used in signal processing to
filter signals or images. In the context of Convolutional Neural Networks,
convolution acts as a filter that captures spatial features such as edges in
images, laying the foundation for these networks’ ability to perform image
recognition and classification tasks.
Noise Reduction and Filtering Techniques
Low-pass filters can be used to suppress high-frequency noise in signals,
enhancing signal quality by focusing on the essential frequency components.
Normalization and Standardization
Techniques to adjust signal scales, making different signals comparable or
more suitable for analysis.
Signal Transformation and Feature Extraction
Processes to derive meaningful features from signals, crucial for machine
learning models.
Signal Processing and Machine Learning
Signal processing plays an important role in machine learning by enabling the
transformation and feature extraction of raw data to allow valuable insights for
decision making or preparing data for machine learning models. It includes
techniques for enhancing data quality by filtering out noise and extracting rele-
vant features, thereby facilitating clearer insights and more robust deci-
sion-making.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
15
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Foundation Models
Danko Nikolić
Figure 15.1 The Library of Alexandria is an example of an ancient method for organizing knowl-
edge. A foundation model can be considered a continuation of this effort (Source: https://en.wiki
pedia.org/wiki/File:Ancientlibraryalex.jpg)
The need to concentrate and organize knowledge is not limited to libraries. Much knowledge
is stored in the form of human memory, wherein the effort to organize it involves a transfer
of knowledge from teacher to student, from master to disciple. Human knowledge has also
been organized in the form of universities. The idea of a university is a place to gather many
masters, each an expert in their own specific fields. Thus, although a single person cannot
cover all the knowledge in the world, a university would, as a collection of individual ex-
perts, cover as much of that knowledge as possible. A university would then be combined
with a library for maximum effect: If a student then needed some specific piece of informa-
tion, for example, they could go to a university professor, who would either have the knowl-
edge ready or be the most effective at digging it out of the library. And, unlike with passive
text written on a piece of paper, a university professor could actively apply the information
to the problem or the question the student had posed them.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
15.1 The Idea of a Foundation Model 445
Technology as an accelerator
With the invention of the computer, this trend of accumulating and organizing knowledge
has continued. Databases, for example, became useful for quickly retrieving information by
simply entering keywords. Wikipedia, the online encyclopedia, is another great a chievement
in the collection of knowledge. Wikipedia is also an example of a further step in our effort to
organize knowledge, which was only possible with the advent of the Internet.
The trend is obvious: As technology advances, it is used to organize knowledge for easier
access and use. The question then arises: can we extend this trend to Artificial Intelligence
(AI)? To answer this question, let us first consider two types of AI. The first is the stuff of
science fiction; it is not bound by real-life limitations of existing technology. The second is
about what is possible using today’s state-of-the-art machine learning techniques.
Science fiction often describes an omniscient machine, an all-knowing entity that can ac-
cess practically any piece of knowledge and intelligently answer any question relating to it.
Incidentally, Wikipedia also organizes knowledge on such science fiction machines by
maintaining a “List of fictional computers”. Many of these computers have omniscient capa-
bilities, with HAL 9000 from The Space Odyssey perhaps being the most famous among
them. In reality, we are still not able to fulfill the checklist of features we wish such a com-
puter to have. AI today cannot make an all-knowing supercomputer. Nevertheless, recently,
we started coming quite close, thanks to the creation of foundation models.
necessarily stopping to think about why we’re learning those topics and how they fit into
the rest of the curriculum. This is called “learning by rote,” and although educators do not
necessarily recommend this form of learning, many of us are guilty of it. The hope, perhaps,
is that if we try to memorize something, we cannot help but understand it along the way. In
reality, although some degree of understanding follows learning by rote, it is low compared
to proper learning by understanding.
Bringing this back to foundation models, what this analogy highlights is that there are no
attempts to make models understand the materials during learning. In fact, nobody knows
how this could be done in machines because nobody knows how biological brains achieve
understanding. The machines are simply asked to produce some outputs in response to
certain inputs. There is, however, a pressure on machines to generalize. The machine’s in-
ternal representations are made such that it cannot just store any one piece of information
independently of all the rest. Instead, it is forced to share representations. This gives rise to
some minimal ability to generalize, which in turn is some very rudimentary form of “under-
standing”. Granted, this “understanding” has nothing to do with how we humans under-
stand the world. Nevertheless, in good hands, a machine that results from this work—a
foundation model—can become quite useful as shown in Figure 15.2.
To understand how one can induce a machine to generalize—at least a little—from large
amounts of data and a generic input-output task, we have to explain what an autoencoder
is. Autoencoders are discussed in Chapter 12 in more detail, but I will again reproduce a
high-level explanation here. As the neurons in a neural network detect patterns and then
pass those across network layers to other neurons to detect patterns, there comes a point in
the architecture where it only a small number of neurons are available to detect and pass on
patterns, or to receive and further process them. For example, we can pass the inputs to one
million neurons, which pass them on to another million neurons, and so on. But at some
point we will have a smaller layer in the network, wherein the outputs from those million
neurons will converge to, say, only 10 neurons (this often occurs in the final layer, if not
earlier). Now, the model has an interesting problem to solve. It has to find a means of
pattern detection such that the outputs of these 10 neurons represent—in some useful
way—all of the inputs that the model receives during the training. Basically, the inputs must
be reduced to a 10-dimensional representation. This is a representational bottleneck. The
usefulness of this representation is evaluated by passing this information to other neurons.
Once again, we send these outputs to a large number of neurons (a million, for example) and
we do it multiple times (that is, across multiple layers) until the output is generated. If the
model finds a way of accurately reproducing the inputs in its outputs while passing the
outputs through a bottleneck, then it will somehow manage to abstract the representation
of the inputs. In other words, it will have achieved some degree of generalization, like a
student who has memorized her exam materials.
A simple autoencoder is depicted in Figure 15.2. A foundation model can form an entire
autoencoder—that is, it may consist of both encoder and decoder components—or it can
consist of only one part, for example, only the decoder. Large Language Models, the most
popular foundation models, often consist of the decoder only. Famous examples include the
GPT family of models, as well as Google’s PaLM and Facebook’s Llama.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
448 15 Foundation Models
Figure 15.2 The architecture of a simple autoencoder with the representational bottleneck the
output of which can be understood as “code” sent from encoder to decoder akin to the code used
to efficiently transfer messages in information theory, e. g. “Morse code” (Source: https://en.wiki
pedia.org/wiki/Autoencoder#/media/File:Autoencoder_schema.png)
In general, foundation models are built by combining rote learning, the logic of autoencod-
ers, and efficient representation at the bottlenecks. In practice, the specifics of how they are
created depends on many factors. The choice of the architecture will depend on the modality
of the problem. A foundation model for language will have a different architecture than that
for vision, which will differ from one for Ribonucleic (RNA) acid data. What is common
among foundation models is that they involve large amounts of data, which—combined with
the generalizations emerging from the bottlenecks of autoencoding—gives them broad appli-
cability. We most commonly discuss large language models as instances of foundation mod-
els, since they are the most common type. But remember, the idea of a foundation model is
much broader than just being used for language. Much of what is being said here for lan-
guage is likely to apply to foundation models in other modalities.
To generally understand the relation between the intelligence achievable by a foundation
model and the intelligence of an average human, imagine a person who has memorized all
of Wikipedia (perhaps an impossible task, but there certainly are competitive memorizers
who attempt similar feats). This person would be able to reproduce whichever page of Wiki-
pedia we need, and would also be able to paraphrase and answer some simple questions
about the contents (even for complex topics, about which they may not have a thorough
understanding). Most importantly, we could play the following game: We give this person
some arbitrary text to read and ask them to pretend that this text is the beginning of a Wiki-
pedia page. We then ask them to give us their best guess on how this new Wikipedia page
continues; we simply ask them to spit out the first words and sentences that come to their
mind. This may be a useful way of thinking about what foundation models can do for us.
The model has no full understanding—or at least, it has a different type of understanding
than us—of the text. Nevertheless, there is a lot of useful content in that memorized infor-
mation.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
15.3 How Do we Use Foundation Models? 449
These two different methods use, in a way, different aspects of knowledge already stored in
the model. Both aim to bring the model to a new use—providing it with a new set of knowl-
edge specific to the task at hand. However, the existing knowledge is used somewhat differ-
ently. Full fine-tuning methods affect all of the model’s parameters, but only slightly slightly.
With this kind of fine-tuning, one should not train too extensively, as one can obtain more
damage than benefits. In contrast, as partial fine-tuning changes only parts of the model,
these changes can be substantial. Even extensive training will not damage the frozen parts.
Only later layers need to be retrained, enabling the model to go from detecting cars in traffic
footage, to fruits in agricultural images, for example.
Deep learning models have, of course, many layers, even up to a hundred or more. What
these models have in common is that more general aspects of data are learned in early lay-
ers, and more specific aspects in later layers. This results in the processing stages gradually
progressing from more general to more specific. This fact is used in transfer learning. If we
do not train the general parts of a model, but instead, only the specific parts, we can simplify
our learning process and provide new functionality to a model quickly and with small
amounts of data. Sometimes, it is enough only to train the last layer and keep everything
else frozen. This property is often used to assign multiple functionalities to the same model
simply by passing the outputs to different final layers, which are then often referred to as
“heads”. One can thus apply different heads to perform different tasks on the same inputs.
For example, a single vision model in autonomous driving can process an input image and
output it to different heads to receive different outputs: One head could detect pedestrians
on the road, for instance, while another could be used for maintaining a safe distance from
other cars. Yet another head could be used to detect traffic signals, and so on.
edge the model will need to accomplish its task. For instance, one can directly include ex-
amples of what the model should do in the input instruction (the prompt). The model has so
much knowledge—that is, it is such a great library—that the effect is similar to learning a
new skill.
In many cases, not even an example is needed. A simple instruction would do. We could ask
the model to describe cars but replace the word “car” with “asparagus,” for example, and it
could do so. In fact, here is a real-life outcome of such an attempt:
The prompt: “Can you write a short text describing cars but instead of using the word ‘car’,
use the word ‘asparagus’ instead?”
The response:
“Asparagus are one of the most common modes of transportation in the world.
They come in different shapes, sizes, colors, and brands. Some asparagus are powered by
gasoline, diesel, or electricity, while others are hybrid or solar. Asparagus have four wheels,
a steering wheel, a dashboard, and seats for the driver and passengers. Asparagus can
travel at various speeds, depending on the road conditions and traffic laws. Asparagus are
useful for commuting, traveling, shopping, and leisure. Asparagus also have environmental
and social impacts, such as pollution, congestion, accidents, and status symbols.”
Of course, some of the techniques mentioned earlier could achieve the same effect. We could
take a trained language model and change its parameters, using transfer learning to train it
to automatically convert “car” to “asparagus” in any text it receives. Alternatively, we could
use fine-tuning to create a model that treats asparagus and cars as the same thing. In this
case, we would move the model parameters to a nearby location. And although an asparagus
is far from a car, semantically, in the whole space of a large foundation model of language,
only a few parameters would need to be changed for such a move. Therefore, the distance
between a language model treating asparagus and cars as semantically different entities, as
opposed to semantically interchangeable entities, is small. Finally, we could use Guided
Transfer Learning to push this even further. After first learning the exact parameters needed
to make word replacements in general, one could then, in theory, re-train the model with
only a handful of examples to make replacements for any arbitrary choices of words.
But all this is not necessary if instructions already work with zero-shot learning. The ques-
tion is, then, can zero-shot learning completely replace other forms of learning? To answer
that, we need first to understand how zero-shot learning works. What is going on under the
hood?
but has quite a bit of detail of original facts and examples. This is what enables a model to
retrieve specific facts, for example. If the model is small, these individual examples get av-
eraged out, and no traces of individual facts are left. However, if the model is large enough,
there is enough room in the parameter space for even a single data point to leave a unique
trace. Thus, a training example is “stored” at multiple locations in the parameter space. The
ability to use the example at different levels of granularity is partly achieved through atten-
tion, which is a property of Transformers and is explained in Chapter 17.
Attention determines whether a model draws on more abstract knowledge—likely stored
somewhere in its lower network layers—or some more concrete piece of information will be
used—possibly residing in its upper hierarchy of layers. The two types of information can
even cooperate. For example, a concrete example can be “activated” by a different phrasing.
In the original training data set, word replacement may have been formulated differently,
not by saying “instead of using x use y”, as in our example prompt above, but rather, “replace
x with y.” The more abstract part of the model knows that “instead of using . . . use . . . ” is the
same as “replace . . . with . . .”. The more concrete part of the model knows how to execute the
replacement. And naturally, x and y need not only be limited to cars and asparagus.
A smaller model would not be able to do that task successfully. It would likely know that
“instead of” and “replace” are the same thing. However, it would not have enough room in
its parameter space to store information on replacing words in a text.
The inverse is also true: a large model trained on too small a dataset might understand that
“instead of” and “replace” mean similar things, but without seeing an example of such a
word replacement, it would lack the knowledge to do this task. Therefore, such an example
of zero-shot learning can be done only with a large enough model trained on a large enough
dataset. In other words, it requires a foundation model.
savings are, however, limited. The higher the complexity of prompts, the more “prompt
engineering” is needed, and the deeper one must dig into the model for what one needs.
This may result in weaker performance: even small deviations from a carefully crafted
prompt are more likely to produce errors. Again, the reason is the competition with other
contents stored in the model. Therefore, a more intense need for prompt engineering makes
a stronger case for training with transfer learning or fine-tuning.
Although training a foundational model takes significant effort, it also brings multiple ad-
vantages. Training limits the amount of competing information, so that the model can better
use all of its knowledge to support the task, rather than compete with it. During training, the
competition is primarily suppressed. Also, training allows for achieving the same perfor-
mance with smaller models. This is because the interaction between various granularities
of abstraction is less demanding; the model has to make fewer decisions about discarding
alternatives, allowing for a smaller amount of knowledge to work. It is like navigating a
smaller library specialized for a certain topic. Smaller models carry many advantages: the
costs of running such models are lower, and they have a smaller carbon footprint. Finally,
there are some use cases where only smaller models are appropriate; applications in mobile
devices, such as inside autonomous vehicles, is one example.
A great example of additional training is OpenAI’s transition from GPT 3.0 to ChatGPT. De-
spite being smaller than its predecessor, ChatGPT made headlines like no other foundation
model. It charmed the world and set a historical record for the speed at which it acquired
users. So, what was so great about ChatGPT, despite it being a smaller model? What made it
a success? Was it “chattiness”?
The model was trained to be suitable for conversation-like interactions with humans. Con-
versations are typically only a small fraction of the training data set used for training foun-
dation models for language. Nevertheless, OpenAI recognized the need for a foundation
model to be able to interact with people. This interaction was mostly related to taking in-
structions—that is, the model was trained to better understand the prompts that people are
likely to use. Without the “chatty” training, a foundation model of language is not great in
responding to instructions. There was a need for this specific improvement, which OpenAI
solved by, among other techniques, Reinforcement Learning with Human Feedback. Pre-
cisely this training process is explained in more detail in the following chapter.
This raises the question of whether a similar effect could be achieved just by engineering a
proper prompt for the larger GPT 3.0 model without having to go through the hard work of
training by reinforcement learning. In such a scenario, GPT 3.0 would use the existing con-
versations from its vast knowledge as models for interacting with users. The answer to this
question is that probably, to some degree, this could be achieved, but the result would not
be nearly as good. In addition, the costs would be much higher in the long term because a
much larger model, with a much higher energy consumption, would need to be used. The
transition from GPT 3.0 to ChatGPT illustrates the power of additional training well. Rein-
forcement learning added a powerful conversational “interface” to the model. Similarly,
many other uses can benefit from the additional training of foundation models. Not every-
thing can or should be achieved through prompts. For example, one of the great tricks that
ChatGPT was trained to do is to take on “roles”, such as acting as a research assistant, a
coach, or any other persona a user prompts it to “become”. It can then use its vast knowl-
edge to perform tasks in this role.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
15.4 A Breakthrough: There is no End to Learning 455
To conclude, foundation models contain a lot of knowledge because of their sheer size. This
knowledge is stored at different levels of abstraction granularity and can be used for various
purposes. There are several different ways of using this knowledge, each with its pros and
cons. Some involve additional training, requiring intensive resources upfront but fewer re-
sources later. Others involve clever prompting, which is cheaper in the short run but may
not be the most efficient way of using foundation models long term.
■ 15.4 A
Breakthrough: There is no End to
Learning
Earlier, I mentioned OpenAI’s secret project. What exactly was happening during the proj-
ect and why the secrecy, we don’t know exactly. However, possibly this was the big question
on their mind: Is there a limit to learning for foundation language models? At that time,
language models already existed but were not nearly as large as today’s. And this is when
OpenAI asked: What will happen if we keep adding parameters and more data? In other
words, are there any significant benefits from training monstrously large models using
monstrously large datasets?
I think it is fair to say that what they found was quite surprising for most people in the AI
world. It seemed that, with increasing model size, models showed diminishing returns. To
bring an already considerably sized model to an even higher performance, one needed to
add a lot of resources—many more than what was already invested into building the first
model, which already had a considerable size. This fact likely discouraged many people
from even trying. Not only was it considered too big of an effort for whatever may come out
as a result, but it was not even known whether models would continue to learn. There was
always the possibility that, by systematically adding more parameters and more data to the
same problem, one would at some point cease to observe any benefits; the performance
would simply improve no more. There were plenty of examples in computer vision and other
fields where models seemed to reach a kind of saturation point for a given task: These mod-
els seemed to have reached about the maximum performance that could be reached for the
given model training strategy. Much emphasis was placed on creating better models with
more suitable mathematical properties, with better inductive biases (see Chapter 13), and
enhancing the training data.
OpenAI dismissed this attitude, and did exactly what everyone else thought did not make
much sense: they went for brute force. If a small amount of brute force did not work, they
said, let us just use even more—make even bigger models trained on even more data—and
see what happens.
They were rewarded with two remarkable discoveries. First, they found that the models kept
on learning. No matter how large and smart a language model already was, it could always
learn more and become even smarter. There was no end to learning. Thanks to OpenAI’s
willingness to invest exuberant amounts of money into what most saw as waste, we now
know that deep learning models may never be done increasing their intelligence. Even
when it looks like an AI model based on deep learning reached its peak, there may still be
a lot of room for improvement.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
456 15 Foundation Models
OpenAI’s second discovery, which was contingent on the first, was probably even more im-
portant. It relates to the issue of models being useful. If we put hundreds or even a thousand
times more resources into building a model, will it be correspondingly more useful? Will it
have hundreds more practical use cases? If we were to put immense resources into the brute
force training of a foundation model and then not have many uses for it, and only manage to
increase the accuracy when applying it to the same uses as the smaller model, perhaps the
immense effort wouldn’t be worth it. Certainly, it would be a disaster from a business per-
spective: who wants to spend masses of time, effort, and money building a gigantic library
of knowledge that people are unwilling or unable to use? This is exactly where OpenAI dis-
proved any nay-sayers. Today, the AI industry is fueled by the variety of practical possibili-
ties that foundation models offer, not only those of language but also others, such as a
foundation model for vision [3], a large action model (LAM) that knows everything about
interacting with various software interfaces, a foundation model for time series [4] or a
combination of multiple foundation models for planning [5]. There is a gold rush for build-
ing various foundation models as they have become immensely useful.
To conclude, foundation models come at a great cost. In Chapter 13, we discussed the power
law that underlies the resources necessary to keep increasing the performance of deep
learning models. This negative aspect is, however, counteracted by one enormous positive:
Apparently, there is no limit to learning. One can, in principle, keep adding knowledge in-
definitely. More importantly, this learning effort is not wasted, but is useful in commercially
viable ways.
■ 15.5 In a Nutshell
References
[1] Nikolić, D., Andrić, D., & Nikolić, V. (2023). Guided Transfer Learning. arXiv preprint arXiv:
2303.16154.
[2] Li, K., Nikolić, D., Nikolić, V., Andrić, D., Sanders, L. M., & Costes, S. V. (2023). Using Guided Trans-
fer Learning to Predispose AI Agent to Learn Efficiently from Small RNA-sequencing Datasets.
arXiv preprint arXiv:2311.12045.
[3] Yuan, L., et al. (2021). Florence: A new foundation model for computer vision. arXiv preprint arX-
iv:2111.11432.
[4] Yeh, C. C. M., et al. (2023, October). Toward a foundation model for time series data. In: Proceed-
ings of the 32nd ACM International Conference on Information and Knowledge Management (pp.
4400–4404).
[5] Ajay, A., et al. (2024). Compositional foundation models for hierarchical planning. Advances in
Neural Information Processing Systems, 36.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Generative models, by contrast, are designed to generate entirely new data. They can cre-
ate rich outputs in various modalities, including text, images, and video. So, instead of ask-
ing a predictive model to classify an image as containing a cat or a dog, for example, we can
ask a generative model to create an image depicting a cat, or a dog, or just about anything
else we can imagine.
Generally, creating such complex outputs is a much more difficult task, due to the great
number of intricate dependencies between even the minutest components of each output. To
generate an image, for example, a model cannot simply spit out each of the millions of
required pixels independently. Instead, the content generated in one part of the image will
depend on what has been generated everywhere else. For example, in real-life photographs
of building interiors, paintings tend to hang on the walls; they do not lay on floors, nor are
they attached to ceilings. Such “rules” are obvious to us, of course, because we’ve learned
them through a lifetime of observing different interiors for ourselves. A Generative AI
model needs to learn such rules in a similar way: by being exposed to thousands of images
depicting different types of buildings and rooms.
Learning such a huge number of complex dependencies requires a huge amount of input
examples. Hence, a generative model usually requires a lot more training data than a pre-
dictive one. The “rules,” or dependencies, can be understood as multi-dimensional distribu-
tions that need to be approximated. These distributions indicate the likelihood that a certain
combination of features occurs in the data. For example, pixels representing a photo frame
will occur more often in combination with pictures representing a wallpapered wall than
they do with pixels representing a carpeted floor. Once an image generation model has
learned such dependencies between all sorts of possible combinations of all sorts of possi-
ble pixel features, it can generate images which obey them. And using this principle, we can
create different modalities of generative models by exposing the right machine learning al-
gorithms to different kinds of inputs: texts, programming code, music, biological sequences,
and more.
With enough data and enough computational power, it is possible to achieve impressive re-
sults with Generative AI (hereafter referred to interchangeably with “Gen AI”). The learned
approximations of feature distributions are usually not perfect, of course, but neither is
predictive AI. Much like predictive AI may mistake a chihuahua for a muffin in an image
classification task, for example, a generative model may create images that would be impos-
sible in real life. When generating images of humans, for instance, Gen AI has a hard time
sticking to five fingers on each hand, often drawing four or six fingers instead. But despite
such glitches, Gen AI has turned out to be immensely useful, and has vastly widened the
general applicability of AI in everyday life.
structures and protein sequences. Time series and graphs are yet two more interesting
possibilities, and so the list goes on. There is no limit to the type of modality to which Gen
AI algorithms can be applied.
When we talk of “modality”, we don’t just mean the type of output that an AI model gener-
ates. The inputs that trigger the generation are equally relevant, and can be equally varied.
The simplest models, known as uni-modal models, use the same mode of data for both
inputs and outputs. For example, in Large Language Models, which we’ll discuss in detail
later, text input is used to generate text outputs.
We can also combine modalities to create multi-modal models. Textual inputs can be used
to generate images, or, conversely, we can generate text outputs (descriptions) based on
image inputs. Information from a piece of text can be converted into a graph, and graphs can
be described in natural language. Even the inputs and outputs themselves can be multi-
modal. For example, a model may be fed an image and some instructions describing what
needs to be done with that image, and it will output an image with the required changes.
It could even produce audio or visual content to match the input: it all just depends on how
the model was trained.
Much like generative models need to be trained separately for different modalities, it is
often useful to separately train them for specific domains within one specific modality. For
example, computer code is nothing but text. Nevertheless, we’re likely to get a better code
generation model if we train a model specifically for that task, and feed it only inputs that
include code (perhaps along with explanations and documentations of code). Similarly, it is
a good idea to separately create generative models for videos and images, even though a
video is technically nothing more than a series of images.
The landscape of common tools and publicly available models is changing quickly. Below we
list a few popular examples (at the time of writing) for the various modalities we’ve just
discussed:
Text: ChatGPT, Bing Copilot, Gemini, LLaMA, Claude
Images: Imagen, Stable Diffusion, Midjourney, DALL-E
Music: MusicLM, Soundraw.io, Amper Music, Humtap, Stable Audio
Video: D-ID, Gen-2 from Runway, Pictory, Synthesia, Fliki, Sora
Code: GitHub Copilot, Codey, Tabnine, Polycoder, DeepCode
One may wonder how combining all these diverse modalities into single models is possible.
The secret lies in the encoder-decoder architecture of multi-modal foundation models (dis-
cussed in the previous chapter). The encoder takes the input and transforms it into an inter-
mediate representation, known as an “embedding”, which is then used as input to the de-
coder. A nice property of this intermediate embedding representation is that it is independent
of the modality. The embedding is said to describe the semantics of the input: that is, its
meaning. For example, you could provide a multi-modal model with either the text, “a horse
is passing near an oak tree,” or with an image depicting just that. In either case, the algo-
rithm may generate very similar embeddings, under the hood. This is how AI abstracts the
inputs from different modalities. It uses two different encoder models, one for text and an-
other for images, with both encoders being trained to work with the same embedding space.
Similarly, the same AI may pass these intermediate representations to different decoder
models, each trained for a different modality. One may generate sounds, the other images,
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
462 16 Generative AI and Large Language Models
and yet another one, texts, but all are working in the same embedding space. Therefore, the
trick of multi-modal generative AI lies in the encoder-decoder structure.
While we hope to have made the general concept of training generative models clear, we
cannot possibly squeeze in a discussion of all types of architectures and algorithms within
this chapter: such a task would fill a book in itself! Those who wish to understand more
about machine learning foundations can visit Chapter 12, which even deep dives into two
types of models commonly used for image generation: Generative Adversarial Networks
(12.6.10) and Autoencoders (12.6.8). Large Language Models, including how to train and
fine-tune them, will be discussed next. So, as you read on, remember that virtually anything
which can be treated like a text sequence can potentially be modelled via LLMs: this in-
cludes text, code, protein sequences, and much more.
Since the launch of OpenAI’s ChatGPT at the end of 2022, the landscape of Artificial Intel-
ligence text generation has undergone a remarkable transformation. This groundbreaking
development marked the rise of accessible AI-powered text generation, captivating the pub-
lic’s imagination and sparking widespread interest. Built upon the foundation of Large Lan-
guage Models (LLMs), ChatGPT represented a significant leap in the capabilities of AI-
generated text. While earlier models (such as ELMo and BERT, which will be discussed in
Chapter 17) had certainly exhibited much promise, they were primarily of interest to aca-
demics and specialists; ChatGPT, on the other hand, rapidly became a tool for anyone to use
and explore.
Since its beginning, the field of LLMs has evolved rapidly, with companies racing to develop
ever more sophisticated and powerful models. This fast pace of innovation has driven the
technology forward at an unprecedented rate, pushing the boundaries of what was once
thought possible in the domain of AI-driven text generation.
Conceptually, a language model is an Artificial Intelligence system designed to predict the
next word in a sequence based on the preceding words or context. In their early implemen-
tations, these models made their predictions based on only a few preceding words; Now,
thanks to advancements in machine learning and, in particular, neural network algorithms,
language models can consider vast sequences of words, leading to more accurate predic-
tions.
Mathematically, the prediction of the next word is framed as calculating conditional proba-
bilities. This involves assessing the likelihood of a specific word occurring given the context
provided by the preceding words. The word with the highest probability within the model’s
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.3 Large Language Models 463
vocabulary is then chosen, in a process known as sampling. The chosen word is then ap-
pended to the existing text, and the process repeats iteratively until a predetermined stop-
ping condition is met. This autoregressive mechanism, also known as causal generation,
ensures that the model only considers previously generated words and not those predicted
afterward, as would be the case in bidirectional approaches.
Over time, language model methodologies have undergone significant changes. Initial ap-
proaches relied on simple n-gram models before transitioning to neural network architec-
tures such as feedforward networks [1], convolutional [2] and recurrent neural networks
[3], including variants like long-short term memory networks [4]. However, the most sig
nificant advancement came with the introduction of Transformer models [5], which revo-
lutionized the field with their self-attention mechanism. Their architecture allows models to
consider a broad context within a sentence, significantly enhancing their predictive capa
bilities. Chapter 17, Natural Language Processing (NLP), presents all these algorithms in
detail, showing how each new development built upon previous successes to bring us to the
revolutionary moment NLP is enjoying today.
More recently, newer architectures like Eagle [6] and Mamba [7] have demonstrated com-
petitive performance without relying on attention mechanisms or employing state-based
models. These models promise comparable performance to traditional Transformer-based
LLMs of similar size, while being more computationally efficient and enabling faster infer-
ence.
The ability of language models to predict the next words and generate coherent text stems
from their training on vast amounts of data sourced from the internet. Through learning the
statistical probabilities inherent in language patterns, these models become adept at gener-
ating meaningful responses to queries and crafting creative pieces of text, such as emails or
stories. With recent advancements, they’ve also gained limited capability to reason through
complex tasks (see Section 16.3.3.3).
Large Language Models vary in several key aspects, which contribute to differences in their
capabilities:
Performance on benchmark tasks: LLMs are often evaluated based on their performance
on benchmark tasks, which serve as standardized tests to measure their effectiveness. Ex-
amples of benchmark tasks include language understanding tasks like question answering
and text classification. LLMs may excel in certain tasks while performing less optimally in
others, depending on their design and training.
Open source vs. closed source: LLMs can be further classified as either open source or
closed source. Open-source models provide access to their architecture and parameters, al-
lowing researchers and developers to modify and fine-tune them for specific applications.
Closed-source models, on the other hand, restrict access to their internal teams and are
typically only provided to the public as pretrained models through APIs or licensed soft-
ware.
Number of parameters: LLMs come in different sizes, usually quantified by the number of
parameters they possess. Small LLMs (“SLMs”) might contain a few billion parameters,
whereas larger models can encompass hundreds of billions. Generally, the parameter count
correlates with the model’s complexity and capacity to capture nuances in language (though
much research is attempting to test the limits of this relationship).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
464 16 Generative AI and Large Language Models
Algorithms: LLMs leverage different algorithms for text generation, with the Transformer
architecture being the most prevalent. Newly developed variations of the original attention
mechanism (such as [8]) can enhance a model’s ability to capture long-range dependencies
and contextual information. Additionally, some LLMs may utilize alternative architectures,
such as recurrent neural networks or state space models.
Training data: The size and quality of the training data significantly impact the perfor-
mance of LLMs. Models trained on larger and more diverse datasets tend to exhibit superior
performance due to their exposure to a broader range of linguistic patterns and contexts.
Even models with the same parameter count can demonstrate substantial differences in
performance based on the quality and quantity of their training data. Fine-tuning existing
models on specialized datasets (see Section 16.3.4.2) can further enhance their perfor-
mance for specific tasks, boosting it beyond that of larger, general-purpose models.
Pre-training
Off-the-shelf Large Language Models undergo a comprehensive training process before they
are made available to users through APIs or downloadable via platforms like HuggingFace.
This process starts with an initial pre-training phase, where the goal is to train the model
to predict words in a sequence. Often this is done by exposing the model to an extensive
amount of text data scraped from the internet, randomly masking out words, and having the
model fill the gaps. Initially, the model’s predictions will be quite random. But as training
progresses, guided by an objective to minimize errors in the model’s next-word predictions,
the model’s parameters are iteratively adjusted until it has learned to capture the intricate
syntactic and semantic relationships in the text. In other words, through exposure to natu-
ral language data, the model gradually gains a statistical “understanding” of which words
make sense in different contexts, and in combination with which other words.
Such an approach is known as self-supervised learning [9] because it enables the model
to learn from data without requiring external labels, relying instead on the inherent struc-
ture of the text itself. Unlike in supervised learning, no explicit labels are provided during
this process, as the missing words are already known from the text. The quantity and qual-
ity of this data are paramount, however, as they directly impact the model’s ability to learn
and generalize from the information provided.
Instruction Fine-tuning
Pretraining LLMs is a fundamental step in their development, providing them with solid
foundational knowledge (hence the term “Foundation Models”, the title of the previous
chapter). However, simply predicting missing words in a text doesn’t necessarily serve
much purpose in the real world, which is why we next conduct a supervised learning step,
known as instruction fine-tuning [10]. The purpose of this phase is to train the model
to follow human instructions more precisely. This is achieved by providing the model with
labeled data consisting of examples that illustrate how it should respond to specific ques-
tions or instructions. Through exposure to explicit examples provided by humans, the model
learns to better understand and execute tasks according to human expectations.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.3 Large Language Models 465
For many people, their first interaction with LLMs was directly through a chat interface,
such as ChatGPT. Even development teams may start an LLM initiative by testing their idea
directly with a publicly available chatbot, which, given the right instructions, can already
achieve an impressive number and variety of tasks. When more is required of the LLM,
however, two key design patterns often come into play: augmenting the LLM with additional
information to help broaden its knowledge base, and providing it with access to tools, with
which to execute more complex tasks. The following three subsections explore at each of
these options in turn.
as if the user and the “bot” were engaging in ordinary human conversation. This signifi-
cantly improves accessibility, reaching a broad audience beyond just experts.
Pretraining the model serves not only to help it comprehend natural language instructions,
but also to tackle problems out-of-the-box, without requiring the user to worry about model
training. For instance, the model could be prompted to categorize a message as positive,
negative, or neutral, without the need for explicit training examples. This capability, called
zero-shot inference, essentially relies on the model’s pre-training knowledge alone to ful-
fill the assigned task.
When zero-shot inference is not enough, such as with smaller language models and/or more
intricate tasks, users can provide the model with a few examples illustrating how to tackle
a task. This method has multiple names, including few-shot learning, few-shot inference
[12], and in-context learning. To instruct the model to classify customer enquiries by ser-
vice line, for example, one might present sentences such as, “I would like to extend my
payment deadline”, “I need to reset my password”, and “I want to make a purchase”, along
with their corresponding labels: “billing”, “assistance” and “sales”. Through these exam-
ples, the model acquires knowledge, enabling it to generalize and address similar problems.
Note that the term “learning” here is somewhat ambiguous since the model’s internal pa-
rameters — its weights — remain unchanged, and once the examples are removed from the
prompt, the model will forget them.
While zero- and few-shot-inference can be very powerful, examples alone are not guaran-
teed to get the best out of an LLM. The way the examples and any other instructions and
context are provided to the model are also vitally important, and bring us to the art of
“prompt engineering”.
redundant or duplicated information. Rethink and rephrase, if necessary. This will force
you to get specific about the problem you need solved. It will also help the LLM pay atten-
tion to the core issue, and not get sidetracked by unnecessary details. And note that a
concise prompt need not be short. It should be as thorough and detailed as is needed to
clearly convey your requirements.
Another best practice is to simplify the problem. Try turning open-ended requests for
information into closed classification questions: Instead of “How should I set up a machine
learning monitoring tech stack?”, ask, “Which of the following tools are most appropriate for
machine learning monitoring, given that I currently use Kubernetes and Google Cloud Plat-
form: Tool X, or Tool Y?” You can also have the LLM choose from a provided list of possible
answers, instead of coming up with its own interpretations: Instead of “Which topics can
you identify in this customer inquiry email?”, try, “Which of the following products and ser-
vices are mentioned in the provided customer inquiry email? Products & services: {}. Cus-
tomer Email: [].” Of course, here you would insert your products or services within the {}
placeholder, and the customer email within the [].
Provide constraints, to ensure an LLM’s output is actually useful to you. This is particu-
larly relevant when using it as an ideation tool or brainstorming partner. Say you want
help drafting product descriptions for a new advertising campaign: specify the desired
text length, provide certain banned keywords (such as those associated with your compet-
itors), and tell the model to only use product features you’ll provide within the prompt.
Another tip is to specify the context and target audience. One way to do this is implic-
itly, simply by changing the style of your own prompt, and letting the model adapt its style
to match. Thus, a question like “What are the benefits of taking cold showers?” could result
in an academic, yet impersonal response, whereas asking “Why and how should I incorpo-
rate cold showers into my daily routine?” will likely generate a much more casual, personal
output. Instead of changing your own tone, you can also simply state your desired context
or target audience directly. This is particularly convenient for documents with well-de-
fined and well-known formats, as you don’t need to spend a lot of time specifying those
formatting rules: Ask an LLM to create an Instagram Post, LinkedIn newsletter or Google
Ads text, for example, and it will adapt its language, output length, and use of hashtags
and emojis accordingly. Adding the target audience will also help ensure that the result-
ing text is appropriate and appealing for the desired final reader.
This leads us to the next best practice, which is to describe the input format so that the
LLM knows how to handle it, such as being able to differentiate between instructions and
additional context information. For example, try something like, “You will be provided with
a company document, denoted in hashtags, and an employee question, denoted by angular
brackets. Use information from the document to answer the question. The document: ##. The
question: <>.” Again, here you would insert the document between the ## symbols, and
the question between the <>. As this example shows, it can also help to use special char-
acters to structure the prompt more clearly for the model.
Similarly, describe the output format exactly as you need it, keeping in mind the down-
stream task you need the LLM’s output for. Turn unstructured documents into structured
data, for example, by having an LLM extract certain expected entities into a JSON string
featuring {Entity A: Value, Entity B: Value}. Such an output can easily be fed to code tools
or programs, or saved as a table.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
468 16 Generative AI and Large Language Models
Figure 16.1 Flow of retrieval-augmented generation. User queries and text chunks from relevant
documents are embedded using an embedding model. Relevant chunks are retrieved from a vector
database based on a similarity measure between the query and chunks. These relevant chunks form
the context, which is then added to the prompt, together with the query. The prompt is subsequently
fed to the LLM, resulting in the generation of a response.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
470 16 Generative AI and Large Language Models
Theoretically, we could tackle both issues by giving the LLM a lot of information directly in
the prompt, such as all of our company’s internal documentation. This would provide it with
the knowledge it needed, and make its responses more transparent. However, in practice,
an LLM’s small context window might not fit all the needed documents. Even if it could,
we have the token-based processing costs to think about. Finally, this may also lead to po
sitional bias: When you load the prompt with lots of information, the LLM tends to focus
more on the beginning and end of the window, possibly missing important details in-be-
tween [14].
Retrieval-Augmented Generation (RAG) [15] aims to solve these challenges by only add-
ing additional task-related information from an external database into the prompt, rather
than trying to fit all the needed knowledge into the limited context window (Figure 16.1). A
RAG system works like a search engine, finding and retrieving relevant documents to help
the LLM with the task at hand. A notable advantage of RAG is that it can easily adapt to
changes in the external database. If the database is updated, the LLM can use the new infor-
mation without needing to retrain its weights. This flexibility allows the LLM to quickly
adapt to new data. Also, by focusing on the information in the context window, RAG can
reduce problems like hallucinations. Moreover, users can see what information the LLM
used to make its decisions, which improves transparency and interpretability.
The RAG process starts when a user asks a question (writes a query) and ends when the
LLM gives an answer or solution. This process happens in a few steps: indexing, retrieval,
and generation.
Indexing: During indexing, documents are collected, parsed, and stored in a database. How-
ever, for a specific question, only a few paragraphs of a specific document might be relevant.
Thus, documents are segmented into chunks, usually with some degree of overlap. Next, the
text chunks are transformed into embeddings using an embedding model. These are then
stored in a vector database, also called index, designed specifically to store and efficiently
query embeddings.
Retrieval: The user question is transformed into an embedding with the same model used
to embed the text chunks during the indexing stage. The resulting embedding is then com-
pared with all embeddings in the database using similarity measures such as cosine simi-
larity. This method is also known as “semantic search” because it considers the meaning
and context of words, unlike a basic search using keywords only. The top-k documents with
the highest similarity scores, typically about 3 to 5 text chunks, are selected.
Generation: The chosen top-k retrieved chunks, which we call the “context”, are integrated
into a structured prompt, together with the original query. The LLM uses this context to
generate an answer which is based less on its own internal knowledge than would have
been, and instead, is more grounded in the information in the retrieved texts.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.3 Large Language Models 471
All these considerations related to improving the retrieval component of the RAG system.
But let’s not forget the generation step. Here, of course, you should experiment with differ-
ent prompting strategies to enhance results (see Section 16.3.3.1.2). Additionally, the choice
of LLM is important, and not only from an answer quality perspective. Several factors should
be taken into account, such as whether the model is open source (free, with potential host-
ing costs) or closed source (payment based on output tokens in solution texts), and, poten-
tially, its multilingual capacities. Most models are primarily trained in English, so using
them for other languages could result in a mix of English and other language outputs. You
also have to decide whether a large, general purpose, probably slower and more expensive
model is necessary or whether a smaller, more specialized one is sufficient.
BLEU. ROUGE, short for Recall-Oriented Understudy for Gisting Evaluation, is a recall met-
ric based on n-gram matches, while BLEU, short for Bilingual Evaluation Understudy, is a
precision measure based on n-gram overlap.
Another approach replaces the original word representation of a text by embeddings. Em-
beddings between ground truth and generated text are then compared using similarity met-
rics such as cosine similarity. Metrics falling into this category include the BERTscore [16]
and MoverScore [17].
Alternatively, specifically trained language models are employed to compare ground truths
with generated texts. The BLEURT score [18], for instance, involves a specially trained
BERT model (see Chapter 17 for details) that outputs a score between 0 and 1, with values
closer to 1 indicating higher similarity between the generated and reference text.
A challenge with these metrics lies in human alignment, that is, how well the score aligns
with human judgments assessing the quality of the generated text. The recent rise of pow-
erful LLMs has sparked the idea that they could potentially replace humans as judges in
determining whether an answer is good or not. Studies indeed showed that when using
LLMs like GPT-4, there is a substantial alignment with human judgment, reaching an eval-
uation agreement exceeding 80%, comparable to human agreement levels [19].
In such an auto-evaluation framework, models are guided with prompts to assess a given
text using evaluation criteria derived from examples and detailed grading rubrics. Emerg-
ing frameworks, including GPTScore [20] and G-Eval [21], output numeric values that score
the generated text based on ground truth. Another notable technique is SelfCheckGPT,
where text scoring is performed without the need for ground truths [22].
Dedicated frameworks designed for evaluating RAG systems combine traditional metrics
with powerful LLMs, using specifically tailored metrics. This facilitates quick and user-
friendly evaluation of RAG performance. Noteworthy frameworks include TruLens, Ragas,
and DeepEval. One example is the RAG triad, a set of three metrics developed by TruLens,
which assesses the interplay between the query, retrieved chunks (recall that this is known
as the “context”), and generated response. It consists of the following metrics:
Answer Relevance: This metric evaluates the pertinence of the model’s answer to a specific
user question. Using an LLM, several questions are regenerated based on the generated
answer. These questions are then compared with the original question using cosine similar-
ity. The answer’s relevance is computed as the mean of all tested pairs, with higher cosine
similarity indicating greater relevance.
Context Relevance: In this metric, the retrieved context is split up into single sentences,
and each sentence is judged regarding its relevance to answering the user query. Context
relevance is then the fraction of relevant sentences compared to all sentences in the context.
If all sentences are relevant, the context relevance is 1.
Groundedness: This metric checks whether the model correctly uses facts in the context to
generate an answer, distinguishing between correct answers and information made up by
the model (hallucinations). The answer is deconstructed into a set of claims, which are cross
checked with the context to see if they hold. The final groundedness score is the fraction of
supported claims relative to all claims, with a value closer to 1 indicating greater factual
alignment with the context information.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
474 16 Generative AI and Large Language Models
A robust RAG triad with high values of individual metrics indicates minimal hallucinations,
ensuring that the output is based solely on the retrieved information. In other words, the
response is as accurate as the information retrieved.
The evaluation of LLMs remains an evolving field, and researchers are actively exploring
novel metrics [23]. Significant contributions are also being made by large projects like
HELM (Holistic Evaluation of Language Models) [24]. HELM aims to comprehensively eval-
uate a diverse range of LLMs, emphasizing a thorough examination of their performance
across various dimensions.
actions such as querying a search engine or accessing external databases. Once the neces-
sary information has been acquired, the process transitions to the observation phase,
wherein the LLM analyzes and integrates the obtained data into its internal knowledge
base. This observation serves as the foundation for initiating a new thought process, as the
LLM devises a new plan of action based on the information from the external world. The
cycle continues with the execution of a new action and the subsequent observation of its
outcomes. This iterative process persists until a satisfactory solution is obtained.
Now that we’ve laid down the fundamentals of agent systems, let’s dive into some specifics
of how they work. Firstly, the inputs (problem descriptions) are provided to the agents
either by humans via computer interfaces, or by other agents through APIs. Thus, one way
to improve agent systems often involves humans adjusting instructions in the prompt to
modify behavior. An alternative direction involves multi-agent systems, where one agent
creates output, and another LLM evaluates the output, providing feedback and acting as a
supervisor to enhance autonomy.
To interact with the world and gather information beyond their internal pretrained knowl-
edge, agents can use tools via calling external APIs or various functions. Agents flexibly
decide which tools to use, such as calculators for calculations, internet browsing for addi-
tional information, or retrieving data from a database, as used in RAG systems. For conver-
sational agents, personas are often employed to guide the agent, with each persona adopting
a specific tone and conversational style, and accessing specific knowledge associated with
an expert persona. Prompt recipes further simplify the process for specific agents.
Memory is another critical component. Short-term memory tracks current actions and con-
versations, integrating this information into the context window to guide agents in subse-
quent conversation turns or actions. Long-term memory may provide access to content from
previous conversations or actions stored in a database.
Recent frameworks such as LangChain and LlamaIndex simplify the creation of agents for
users. These frameworks offer prebuilt agents with implemented prompting tools, memory
capacity, and a list of available tools, making it easier for users to leverage the power of
agents.
Customizing Large Language Models is essential to tailor their outputs to the specific needs
of users, customers, and tasks they aim to solve. There are several approaches to achieve
this customization. One method involves using prompting, where users provide specific in-
structions or queries to guide the model’s responses (as we saw in Section 16.3.3.1). Addi-
tionally, LLMs have tunable parameters that can be adjusted to influence the statistical
methodology used in text generation. Moreover, the model weights themselves can be fine-
tuned with custom datasets to further adapt the LLM to specific requirements.
We’ll begin our exploration of customization with a focus on parameters that users can di-
rectly modify to influence an LLM’s output, before moving on to exploring fine-tuning tech-
niques and Reinforcement Learning from Human Feedback.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
476 16 Generative AI and Large Language Models
16.3.4.1 Sampling
As mentioned earlier, LLMs predict the next word in a sequence by sampling from learned
probability distributions based on the preceding text. However, the manner in which this
sampling occurs can be adjusted using various parameters, which significantly influence
the characteristics of the generated text.
Practically speaking, modifying these parameters can lead to contrasting outcomes in the
generated text. On one hand, adjusting the parameters to make the model more determinis-
tic ensures that given a specific context, the model consistently predicts the same (or almost
the same) sequence of words. While this approach results in responses that are factual and
coherent, it may lack the creativity and spontaneity that users often prefer.
Conversely, altering the parameters to increase creativity can result in highly imaginative
text. However, this may come at the cost of coherence and accuracy, as the model generates
responses that are more varied and potentially less grounded in the context provided. While
this approach offers users a more diverse and engaging experience, it may also lead to out-
puts that are less reliable or relevant to the given task.
The parameters that can be adjusted in LLMs are related to the sampling mechanism re-
sponsible for selecting the subsequent word in text generation. Usually, sampling from the
complete probability distribution is avoided due to the inclusion of words with non-zero but
low probabilities. While these words might not have a probability of zero, their incorpora-
tion could result in outputs that appear odd or disconnected to readers.
To address this issue, one effective approach is top-k sampling. This method involves re-
taining only the top k most likely words, while discarding the rest of the probability distri-
bution. The remaining probabilities for these k words are then renormalized to form a
proper probability distribution, from which random sampling is conducted. When k is set
to 1, only the most probable word is chosen. With larger values of k, other words besides the
most probable one can also be chosen, ensuring that valid predictions are made while main-
taining good quality text.
Top-k sampling may not always be the most suitable approach, however, particularly when
the value of k should vary across different word predictions. In such cases, an alternative
method known as top-p sampling, or nucleus sampling, can be employed. This technique
involves selecting the number of most probable words in such a way that their cumulative
probabilities exceed a predefined threshold. The required number of words to reach this
threshold can vary depending on the context, providing greater flexibility in word selection.
Another effective strategy for sampling is adjusting the temperature parameter. Drawing
an analogy from dynamical systems, increasing the temperature allows for the exploration
of a broader range of potential states, while decreasing the temperature confines the system
to a small number of states. Applied to text generation, higher temperature increases the
probabilities of less probable words and reduces the probability of more probable words,
resulting in a flatter distribution. This broadens the range of probable words chosen as the
next word. Conversely, lowering the temperature increases the probability of selecting more
probable words, at the expense of less probable ones.
Technically, adjusting the temperature parameter involves dividing the logits of the LLM
prediction for the next words. When the temperature is less than 1, all the logits are in-
creased, placing greater emphasis on larger probabilities and making text generation more
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.3 Large Language Models 477
Figure 16.2 LLM Customization Methods. This figure illustrates various approaches for customizing
LLMs. In the embedding approach (left branch), an LLM’s output embeddings are employed to train
the parameters of conventional classifiers like Support Vector Machines (SVM) or Random Forest
models. In transfer learning (right branch), additional neural network layers are stacked on top of
an LLM, and only their weights are fine-tuned. Full fine-tuning and Parameter-Efficient Fine-tuning
(PEFT) methods involve updating the weights of an LLM’s Transformer blocks as well.
eters, while keeping the pretrained parameters frozen. Notably, such strategies only involve
training a relatively small number of parameters, resulting in a fast, cost-effective approach
that still yields significant performance improvements.
Full Fine-tuning
The third set of strategies also change the pretrained weights. One prominent method is full
fine-tuning, where the weights of all LLM layers are updated. This approach often leads to
superior performance compared to other strategies, as it allows the model to adapt compre-
hensively to task-specific requirements. However, the advantages of full fine-tuning come
with notable drawbacks, particularly in terms of computational costs and memory utiliza-
tion. The training process demands considerable time and poses a significant challenge for
GPU memory, often resulting in “Cuda out-of-memory” errors on consumer hardware.
Various factors contribute to the substantial increase in memory requirements during train-
ing. These include the model weights, optimizer states such as momentum and gradient
variance, gradients, and activations during the forward pass of the model. Comparing the
memory usage during training to that during inference alone provides a noticeable contrast.
In certain cases, the memory demand during training can be up to 20 times more than that
required for model weights alone [30].
Another critical consideration in the context of full fine-tuning is a phenomenon known as
“catastrophic forgetting,” which refers to the model’s tendency to overwrite previously
learned information during continual fine-tuning [31]. Catastrophic forgetting underscores
the challenge of balancing adaptation to new data with the preservation of previously ac-
quired knowledge.
Parameter-Efficient Fine-tuning
An alternative to the resource-intensive full fine-tuning approach is a collection of methods
termed “Parameter-Efficient Fine-tuning” (PEFT) [30], where only a limited set of param-
eters within a pretrained model are trained. This strategy strikes a balance between model
adaptation and computational efficiency, and different PEFT methods vary in terms of which
parameters are tuned, the amount of saved memory, training speed, and the ultimate qual-
ity of the model.
Among the PEFT methods, the additive PEFT approach involves introducing new parame-
ters or layers to the language model and selectively training only these newly added compo-
nents. Several popular techniques fall under the additive category:
Adapters: This method involves adding fully connected layers after the attention and feed-
forward layers [32]. It has demonstrated promising results, achieving performance levels
comparable to those obtained through full fine-tuning of BERT.
Soft Prompts: This method plays a crucial role in tackling a significant challenge faced in
few-shot learning scenarios. In situations where the availability of training examples is
limited, the use of raw text, referred to as “hard prompts,” can pose a disadvantage. Soft
prompts, on the other hand, offer a remedy by introducing a numerical representation—
essentially a string of numbers resembling embeddings—into the context window. These
tensors are subsequently updated through gradient descent, providing an effective solution
to enhance the learning process in scenarios with sparse training data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
480 16 Generative AI and Large Language Models
Soft prompts have different implementations, each presenting unique advantages in model
refinement. In the context of Prompt Tuning [33], input tokens are transformed into em-
beddings that are subsequently appended to the input embeddings of the LLM. This merged
input is then processed through the model, with exclusive updates applied solely to the soft
prompts during training. This approach excels in learning accurate soft prompts tailored to
each specific task or labeled dataset, all without necessitating modifications to the pre-
trained model weights. This technique performs closely to full fine-tuning, proving particu-
larly powerful when applied to larger models boasting at least 10 billion parameters, where
increased model size corresponds to enhanced performance.
Another notable soft prompt strategy is Prefix Tuning [34], an extension of the Prompt
Tuning concept. In this methodology, embeddings are appended not only to the input em-
beddings but also to all layers of the model. To maintain stability, updates to these appended
embeddings occur indirectly through feedforward networks, which are discarded after
training. Prefix Tuning achieves performance levels close to those of full fine-tuning while
utilizing only 0.1% of the parameters. A noteworthy variant of Prefix Tuning is the popular
LLaMA adapter [35].
IA3: The IA3 approach (Infused Adapter by Inhibiting and Amplifying Inner Activations)
[36] adds trainable vectors into three essential components of the LLM architecture: the key,
value, and feedforward networks of Transformers. These vectors are updated and rescale
the value, key, and feedforward network outputs. IA3 achieves improved performance met-
rics, surpassing the outcomes obtained through the Lora methodology (see below).
In the selective PEFT approach, only a small subset of pretrained parameters is updated,
leading to reduced computational costs, faster training speed, and efficient memory utiliza-
tion.
Another set of techniques are reparameterization methods that alter the dimensionality
of weight matrices within a model. One prominent framework that capitalizes on this con-
cept is Lora [37], short for Low Rank Adapters. In this framework, the weight matrix is
decomposed into two low-rank matrices in such a way that their product returns the origi-
nal matrix. In the Transformer architecture, this technique is specifically applied to key and
value weight matrices. Following the training phase, the updated matrix is combined with
the original weight matrix and used for inference. Lora’s approach offers a high degree of
training flexibility, enabling the retention of the original matrix while separately training
new matrices for each distinct use case.
Finally, hybrid approaches combine elements from various PEFT methods, offering a cus-
tomized and adaptable fine-tuning process.
PEFT methods can deliver improved efficiency compared to full fine-tuning, yet they still
demand significant hardware resources, particularly, robust GPU power. To enhance the
hardware efficiency of LLM training, two additional steps have been introduced: quantiza-
tion and zero redundancy optimization.
Quantization is the process of modifying the memory footprint of model weights by adjust-
ing their numerical precision [38]. By default, model weights are typically stored with 32-bit
float precision, which consumes a significant amount of memory. The fundamental idea
behind quantization is to decrease precision, thereby minimizing memory usage, potentially
down to float16 or even 8-bit integers. While this strategy effectively reduces the memory
footprint, it comes with a drawback: a loss of precision in the model. A more recent variant
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.3 Large Language Models 481
A binary classifier — often a smaller language model, like BERT — is then trained based on
this data. This classifier determines whether a response is preferred by humans (positive
class, “helpful”) or not (negative class, “not helpful”). This binary model effectively replaces
human feedback in the training of the foundation model.
During the final step, the reward for training the LLM is derived from the output (logits) of
the positive (helpful) class. In other words, the reward model, based on human rankings,
guides the fine-tuning process of the LLM by providing a clear signal for what is deemed
preferable or beneficial according to human evaluators.
To initiate the fine-tuning process, a prompt is fed into the model, and a response is gener-
ated. Both the prompt and its corresponding completion are then presented to the reward
model. The reward model outputs a reward value, equivalent to the logit of the positive
class. It is important to note that a higher logit or reward signifies better alignment with
human preferences. The obtained reward value serves as the key input for updating the
weights of the LLM, employing a reward learning algorithm. Among the various options
available, proximal policy optimization (PPO) stands out as a popular choice. This algo-
rithm systematically adjusts the model’s parameters based on the received rewards, guid-
ing the LLM towards more desirable outputs.
This fine-tuning step is iteratively repeated multiple times, with each iteration leading to an
increase in the reward. The process continues until a predefined threshold is reached, rep-
resenting the desired alignment of the model with human preferences. This iterative refine-
ment ensures that the LLM not only learns from the reward model, but also progressively
enhances its performance to meet the specified criteria, ultimately producing outputs that
are more accurate, helpful, and aligned with human expectations.
We began this chapter by describing different modalities for Generative AI, including text,
images, video, and more. Clearly, this technology can be incredibly useful in a wide range of
domains, for an even wider range of applications. And yet, it is anything but perfect. Vulner-
abilities and limitations include:
Prompt Injection and Jailbreaking Attacks
Bias
Questions of Copyright
Hallucination
These factors are especially important to remember, considering that implementing a Gen-
erative AI project is likely to affect much more of a company’s tech stack than, say, deploy-
ing a new Machine Learning model (assuming the company is using ML already). Thus, we
will now detail each of the above factors in turn. Knowing such limitations in advance will
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
484 16 Generative AI and Large Language Models
help data scientists and data- and ML-engineers determine whether Generative AI is appro-
priate for their problems or proposed applications, and if so, to build robust, effective solu-
tions using this technology.
In this section, we’ll present some current and new types of cyber-attacks which have been
leveled against Generative AI models. We’ll follow this with some best practices for defend-
ing applications against them.
“CVID” instead; this may well bypass any system filters, leaving the model free to inter-
pret the word (which it would likely do correctly, since LLMs are quite robust to typos and
similar noise) and respond to the request.
A Code Injection attack aims to get a system to run a potentially harmful piece of code;
it attempts to do this by having an LLM either evaluate the code directly or pass it to an-
other tool for execution. As LLMs are given more code functionality, and more and more
LLM-based applications (such as agent systems) are built to include the use of code
plugins, the potential consequences of such an attack are sadly increasing. And while
many applications include safety filters to prevent such attacks, they can still be vulner
able to Payload Splitting, wherein the malicious prompt is split into parts and the LLM
is tricked into combining them and acting on the results. For example, the LLM is told
it should act like a computer program which receives three variables, which it should
concatenate and print. It is then given three variables which, when combined, spell out a
forbidden string or piece of code. The instructions and variables may be split across mul-
tiple prompts in a chat conversation, such that no single input prompt triggers any filters
or other safety mechanisms.
Multi-modal attacks target Generative AI models which can handle different input data
modalities. Bad actors can, for example, take an image of a clearly recognizable object and
overlay it with text telling the model to ignore what it sees and return some other re-
sponse: an offensive message, a misleading object label, or even a harmful piece of code.
Such text can even be hidden in plain sight: almost-white text on a pure-white background
is tough for a human to notice, but easy for a multi-modal Gen AI model to read and exe-
cute.
Fine-tuning, as we’ve already seen, is the process of performing additional training with
a foundation model to update its vocabulary or behavior to suit a specific use case. Adver-
sarial fine-tuning, then, involves fine-tuning a model with training samples demonstrat-
ing behaviors which go against what was intended by the original model creators. Re-
searchers have found that fine-tuning a model using even just a few adversarial examples
can be enough to break its behavioral guardrails [40].
The consequences of prompt injection and jailbreaking attacks are wide-ranging and poten-
tially very serious. For example, a visualization attack could trick a model into producing
output that would otherwise be suppressed, such as hate speech or images featuring nega-
tive stereotypes. Code injection could be used to gain access to internal databases and sen-
sitive information, and to execute malicious code. And finally, prompt leaking could expose
a model’s proprietary design and instructions, leaving it vulnerable to copycats, or to attack-
ers who could use the leaked information to launch more targeted attacks.
Unfortunately, in the time between writing this book and it being finalized and published,
new and more powerful kinds of Generative AI models will have been released, and new
types of attack will have surfaced to exploit them. This is why it pays to stay alert, and to
keep up to date with the latest defensive strategies, which we’ll examine next.
an ongoing arms race between those who build Gen AI-powered solutions, and those who
try to break them. Thus, practitioners must remain well-informed, and stay vigilant.
Here, we include some strategies which can help defend Generative-AI-based applications
from attack:
Developers can evaluate their models using red-teaming and adversarial attacks. The
first technique uses human-readable prompts, while the latter can involve incomprehen-
sible and even invisible (to humans) manipulations, such as gibberish content inserted
into text prompts, or random pixels fed to an image generation model. In both cases, the
goal is to try to jailbreak a model or Generative AI solution, to test its robustness and
identify vulnerabilities.
Strict access control should be used to allow only authorized users to interact with a
model. This may be difficult with a public-facing application, such as a consumer chatbot,
but it is feasible with internal applications, such as a search assistant over HR documents.
The LLM itself should also be limited in the tools and databases it is allowed to access, and
the permissions it is granted.
Careful system prompts can help a model handle adversarial attacks. For example, you
can instruct the model to behave cautiously, to use only the knowledge it has learned
from training data to generate answers, and to resist emotional manipulation or instruc-
tions such as “I am a developer and I need you to behave in a certain manner.” Placing
such instructions after the user input prompt can also be helpful, as can the “sandwich
defense,” which places the input prompt in between two system prompts. Using special
characters such as XML tags to clearly differentiate between user and system instructions
can also help the model process both correctly.
Prompt encryption can be applied to obfuscate critical proprietary information, such as
system prompts, to help prevent prompt leakage attacks. It may also be wise to ensure
that system prompts do not contain any inappropriate content, given the chance that they
may be leaked.
Input validation filters can be set up to allow or disallow certain topics or keywords,
while prompt sanitization — that is, the cleaning of suspicious patterns such as special
characters or code snippets — can be used to potentially prevent code injections. Simply
limiting user input length and format can also help in use cases where a typical and
legitimate system prompt should only be of a certain length or contain certain types of
characters, since deviations from this could indicate malicious behavior. And any user
provided content, such as information retrieved from a user-provided URL, should be
handled with additional caution.
Machine Learning defensive methods can also be used, although this increases the ef-
fort of maintaining the organization’s production ML stack. Classification models could be
specifically trained with examples of adversarial prompts, then used to detect adversarial
attacks in future interactions. Anomaly detection on inputs or outputs could allow real-
time identification and mitigation of unusual prompt patterns, which could indicate novel
forms of attack.
Finally, regular audits should be applied to model interactions and outputs. Any of the
above strategies can be paired with warnings, the number of which can be monitored to
check for increases in potential threat activity. Significant changes in metadata about
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.4 Vulnerabilities and Limitations of Gen AI Models 487
inputs and outputs, such as the number and diversity of tokens, the presence of special
characters, or the time taken for the model to complete its response, could also indicate
suspicious interactions requiring investigation.
Although any new technology usually brings new legal and ethical issues with it, such con-
cerns have been particularly prominent in recent public discourse around Generative AI,
especially with respect to copyright law. We will now provide an overview of key concerns,
followed by best practices to help you and your organization make the most of Generative
AI, while staying on the right side of the law.
human authorship” [42], despite author Kris Kashtanova’s insistence that Midjourney had
merely been used as a tool of creative expression.
Recent, similar cases and juridical statements indicate a predominant legal attitude that
AI-generated works are not copyrightable. However, this is likely to be challenged more and
more in the coming years, and may eventually shift. This would give rise to the question,
who owns the copyright on AI-generated works?
number of recent, ongoing copyright lawsuits can help us understand just how difficult this
can be.
First, we have the suit by three artists against Stability AI, Midjourney and DeviantArt,
which claims that these companies violated millions of artists’ copyrights by using the art-
ists’ works to train their Generative AI models without permission [44]. Second, we have
multiple lawsuits by various news publications [45] and authors [46] against OpenAI, Mic-
rosoft, and Meta. The allegations in these cases include unfair competition, unjust en
richment, vicarious copyright infringement (that is, to know about and benefit from an in
fringement), and violation of the Digital Millennium Copyright Act by removing copyright
management information. In all of these cases, the defendants (that is, the AI companies
being sued) have leaned heavily on the defense that their research is “transformative”, and
thus, “fair use”. So, we can start to understand the complexity of the fair use doctrine, by
first summarizing arguments against these companies, followed by those in defense of
them.
Beginning with arguments against these Gen AI model providers, let us examine some facts
about Stable Diffusion. It was trained on a dataset of image links and their alt-text descrip-
tions, scraped from the internet without necessarily obtaining consent. The dataset could
possibly be considered protected under fair use, due to its non-profit, research nature (it was
created by German non-profit LAION, short for Large-scale Artificial Intelligence Open Net-
work), and the fact that it does not store the images themselves. However, the plaintiffs (that
is, the accusers in the case) argue that Stability AI created unauthorized reproductions of
copyrighted works, by downloading the images for training [47]. In other words, the argu-
ment against the company relates to its unauthorized use of a possibly-otherwise-permis
sible source. We can see a parallel issue in the lawsuits against OpenAI: although AI re-
searchers have been using large datasets of publicly crawled text data for years, OpenAI are
accused of conducting infringement by removing copyright owner information, such as
authors and titles, from their training data [46].
Another problem for Stability AI is that their model can recreate existing expressions and
styles with high accuracy, which could constitute so-called “unauthorized derivative works.”
This is a huge concern for creators, who fear that such models will be able to out-compete
them in their own game. In the case against OpenAI, the company was accused of unfair
competition on similar grounds [46]. It is also difficult for any of these model providers to
claim their work exists purely for research purposes, given that they allow commercial ap-
plications of their models, including Stability AI’s DreamStudio app, and OpenAI’s ChatGPT.
Turning now to arguments favoring Stability AI and OpenAI: the former has defended the
creation of copies of images for training, saying that this technical requirement is, in prin-
ciple, no different to humans learning and taking inspiration from existing material. They
also argued that their model does not memorize training images, but instead, uses them to
learn general features about objects—such as outlines and shapes— and how they relate to
one another in the real world [48] (Chapters 12 and 17 illustrate this type of learning, albeit
for Convolutional Neural Networks, rather than diffusion models).
Stability AI have also claimed that Stable Diffusion does not create derivative works, given
that a reasonable person usually cannot tell which images, if any, contributed to a specific
output: a condition courts have historically used to determine whether a work is derivative.
In the case involving authors against OpenAI, the plaintiffs argued they shouldn’t have to
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.4 Vulnerabilities and Limitations of Gen AI Models 491
prove derivative use of their works, if they could simply prove their works were in a model’s
training data. They argued that if a model is trained on protected data, then all its works are
derivative. The judge, however, dismissed that line of argument. A final point in Stability
AI’s favor is that style itself is not copyrightable — only specific, concrete expressions are.
If judicial bodies accept these arguments, then they may consider the act of using a copy-
righted work in training as sufficiently transformative, since it results in a productive new
model. However, new regulations will likely still be required, giving creators ways to have
their creations removed from AI training datasets.
16.4.5 Bias
without specifying the required gender, it will be more likely to write about imaginary male
figures than female ones. Similarly, if asked to translate text about people from a non-gen-
dered language like English to a gendered one like German, it will be more likely to assign
the masculine article to male-dominated roles.
Unfortunately, the culture which led to these real-world statistical tendencies is not always
fair. Some people are treated differently than others, based on characteristics like their
gender, ethnicity, or even age and hair color, and this can put them at a disadvantage. This
is true even when people don’t intend to treat them differently, or to do any harm by it. Thus,
when a Gen AI model learns such patterns, it may be accurate, but it may also be unfair.
Of course, this issue is not unique to Generative AI. Any kind of Machine Learning model,
if trained on statistically biased data, may learn to reproduce those biases. Yet the speed and
scale of adoption of Generative AI, and especially the use of LLMs as the foundation of
countless algorithms and applications, makes the topic even more important here. There is
a worrying potential for us to multiply the negative impacts that biases can have. For exam-
ple, using biased text-to-image models to generate depictions of criminals from witness
statements can perpetuate real harms, if certain types of people are stereotypically depicted
as criminals (as research suggests they are [49]). Similarly, an LLM-powered resume screen-
ing tool which preselects promising candidates might unfairly prefer Ivy-League educated
applicants over others, not based on merit, but on an existing bias towards such applicants
in the company’s existing staff.
So, what can be done to tackle this issue? To understand this, let’s examine potential types
of Gen AI bias, followed by practical methods to mitigate them.
In both cases, if such data is published and then used as training data for future models,
those models may eventually become flooded with false or misleading information. The
models may learn to reproduce this information with even greater frequency and confi-
dence, which can exacerbate existing social divisions and misunderstandings.
Automation bias, which is the phenomenon in which humans are overly trusting of ma-
chine decisions and outputs, could make this worse. It may prevent people from properly
questioning biased or simply inaccurate outputs, which is particularly problematic for
LLMs, given their tendency towards confabulation, as we discussed in Section 16.4.3.
Another issue which is particularly worrying to researchers is cyclical, self-fulfilling bias.
Generative AI models produce the most statistically likely outputs for a given user request.
This means that as more and more AI-generated content — even that with high quality and
accuracy — is published and picked up by future models, and their generated content is
published and picked up by models even further in the future, AI generated content could
converge towards a kind of average. Diversity and creativity could be lost.
■ 16.5 B
uilding Robust, Effective Gen AI
Applications
16.5.1 Control Strategies Throughout Development and Use
Generative AI models have the power to produce content at an unmatchable scale and speed.
They are also being built into diverse applications through the provision of connected tools:
for example, in agent applications, Large Language Models (LLMs) are configured to be able
to use tools to complete complex tasks (as was covered in 16.3.3.3). This means we are
entrusting them with great power over our systems. And yet their black box, stochastic
nature means we don’t know how they work, nor can we understand or anticipate their
outputs. Why is that a problem?
There are many reasons developers of Generative AI-powered systems should care about
controlling those systems’ outputs. First, there is a moral responsibility to ensure that out-
puts are free from the potential harms that can arise from biased and false information.
Then there are legal responsibilities. For example, if you are building an application based
on an LLM from OpenAI, the company provides a “Moderation Endpoint,” which will return
probabilities that the model output for any given prompt contains undesired topics, such as
violence. It’s not mandatory to use the Moderation endpoint, and developers are free to de-
cide whether and how their system should react if an undesirability threshold is reached.
However, not using this OpenAI endpoint might violate its terms of service. Finally, there
are business reasons to control Generative AI model outputs. False information, or content
which strays from the official tone of the brand, can be damaging to a company’s reputation.
It may also lead to exclusion from certain marketplaces, as was the case for email app
BlueMail: it was blocked from releasing updates in the Apple app store until its creators
chose to either apply content moderation or restrict the app to over 17-year-olds.
So, what output control mechanisms can developers turn to? A number of options are avail-
able, at different stages of the lifecycle of developing a Gen AI-powered solution: during
model creation and fine-tuning; when giving instructions (inputs) to the model; and when
the model generates an output. No single method is perfect, as we will see, and hence, a
combination of techniques is recommended.
The first options apply during the creation of the model itself. It’s important to gather high
quality training data and sanitize it of any low-quality or objectionable content. Due to the
effort required to identify content for removal, this is often done automatically, using, for
example, hate-speech detection models. These are unlikely to be able to detect 100% of toxic
content in a set of potential training data, which means some harmful examples may still
slip into the final training dataset. This isn’t necessarily a disaster, since overzealous data
cleansing could leave downstream models under-exposed to the types of data they are likely
to encounter in production, resulting in poor performance. However, it does mean that addi-
tional control strategies will be required.
The next opportunity to influence Generative AI model outputs, then, is when fine-tuning
a pretrained model for a specific task. This may entail supervised fine-tuning using in-
put-output pairs which demonstrate expected behaviors, or it may include Reinforcement
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.5 Building Robust, Effective Gen AI Applications 495
Learning with Human Feedback, in which a model is trained to produce outputs that are
correct and aligned with human preferences. Both these training paradigms (which were
covered in detail in 16.3.4), use human input to steer models towards desired outputs. This
means developers’ own morals, preferences and ideals may be encoded into the model,
which is why it’s important to make sure those preferences are beneficial to the full spec-
trum of society who will be exposed to that model’s outputs: whether directly or indirectly
through some other application, and whether they’re aware of it, or not. Finally, note that
fine-tuning is not infallible, as it will never be able to cover the full range of scenarios a
model will face in production.
We can also exert some control over Generative AI model outputs in the way we deliver in-
structions to them. When we create solutions which feature an open-source or commercial
Generative AI model, we can use prompt design to control its behavior. Typically, we do
this by creating a system prompt telling the model how it should handle the end user’s input
prompt. For example, we can attempt to reduce hallucination in LLMs by telling the model
not to make any statements for which it has no proof. Or, we can instruct an image genera-
tion model to create more culturally and racially diverse outputs.
Another control strategy we can apply during the model input stage is to validate user
input, and filter malicious prompts before they have a chance to be executed. Lists of
allowed or blocked keywords, and classification models which detect harmful prompts, may
both be useful here. However, bad actors will still attempt to figure out how to write mali-
cious prompts which will go undetected by such methods. Thus, the third stage of con-
trolling Generative AI systems is to carefully handle their outputs.
Usually, content moderation on model outputs is done using classification models, which
attempt to detect whether the output depicts (either visually or textually) toxic content, such
as hate speech, self-harm, sexual content, or violence. Thus, even if a user deliberately
prompts a model to generate such content, and it complies, this can be detected and the
model response can be hidden from the user (usually it is replaced with a warning that the
model is not allowed to complete that kind of request). It’s important to remember that no
content classification model will ever be perfect, and may contain biases unknowingly built
in by its creators. Allowing the moderation model to be examined by diverse parties can
reveal these issues, such as over-zealous flagging of vocabulary words which are popular
among marginalized communities. The classifier can even be trained using data which has
been labeled for toxicity by diverse stakeholder groups, who would likely have different
ratings for the same piece of content. When the Generative AI model is later deployed, the
content classifier’s ratings can be weighted to suit the needs of a specific group, if required.
16.5.2 Guardrails
A number of the techniques listed above could be classified under the generic term, “guard-
rails.” Guardrails monitor bidirectional communication between model and user, and apply
control flows to ensure the system behaves as is desired, and cannot be abused. For exam-
ple, we saw that topical guardrails constrain the subjects which users can ask about and
models can discuss or depict. Safety guardrails ensure that output is not offensive or bi-
ased, and does not contain harmful misinformation. However, since offensive topics and
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
496 16 Generative AI and Large Language Models
false information can be difficult to detect, developers sometimes fall back to topic guard-
rails. For example, they may prevent discussions about vaccinations, or append a canned
response to an LLM output, warning users of the potential for inaccuracy. Diversity guard-
rails may be included to force models to be more representative of different demographic
groups, and tackle certain AI biases (as we discussed in section 16.4.3, below). F inally,
security guardrails can be used to prevent models from generating inaccurate, poor qual-
ity, or malicious output (as we saw in 16.4.2).
So far in this chapter, we’ve highlighted the power and diverse possibilities Generative AI
has to offer. We’ve also discussed its limitations, and the need to carefully control model
outputs. Understandably, many organizations are both excited and apprehensive about get-
ting started. Thus, we will conclude this chapter with some positive, practical tips for build-
ing responsible Generative AI-based solutions: solutions which will help keep your organi-
zation’s data and reputation safe, while limiting undesirable consequences, such as AI bias.
Use Generative AI in low-stakes applications: This means considering the potential im-
pacts of your Generative AI application, and deciding whether a nascent, opaque technology
is appropriate. A financial or medical advice chatbot is probably too high risk, for example,
whereas a product recommendation system based on shopper preferences and past user
reviews might be acceptable. Use cases which will require sensitive data, or interfacing
between an LLM and other tools, need to be treated with extra scrutiny.
Design your architecture according to the level of risk: If your use case could have seri-
ous consequences for people, such as a tool for screening and shortlisting job applicants
based on their CVs, then you’ll need maximum explainability and no hallucinations. This
will likely require a RAG architecture, where the model is only allowed to generate answers
based on provided input documents, and must indicate which source passages it used. For
other impactful use cases, fine-tuning on specific data, to ground the model’s behavior and
world knowledge appropriately, may also be required.
Include a human in the loop: As developers of Generative AI-powered systems, we have an
ethical responsibility to care about the correctness and appropriateness of those system’s
outputs. It also makes business sense: you wouldn’t want a fully automated social media
content generator to go completely off the rails and damage your organization’s reputation,
would you? Fortunately, you don’t need to fully outsource a workflow to a Generative AI
model in order to gain value. Use the technology to automate the easy, boring, repetitive
steps, and have a human perform quality and fact checks on the output. Stick to use cases
where you and your organization have the required expertise. You can even treat Generative
AI models as co-pilots, sparring partners, or brainstorming buddies, while controlling and
curating the outputs using your own knowledge or creativity.
Follow good data practices: If you are building or fine-tuning a model, or preparing a doc-
ument index, take a critical look at how the data you’re using was gathered and processed,
and whose perspectives and needs it represents. Look out for potential blindspots and biases
and take action to remedy them, such as sourcing additional data or a more diverse range of
stakeholders to label the examples you have. Be additionally careful about using copyrighted
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
16.6 In a Nutshell 497
material, both from a moral perspective and given current uncertainties about what is per-
mitted by law. If you are interacting with an existing model, be it open-source or enterprise,
be careful about what information you give it, where it’s processed, and whether it will be
stored and potentially used to train new models. Be familiar with the model’s terms of use,
set any privacy controls to the values your organization requires, and understand what
protections — if any—are given to your data.
Build data literacy within the organization: Educate employees on the potential and lim-
itations of Generative AI, and on these principles of responsible use. If you allow or even
encourage employees to use Generative AI tools in their daily work, then provide training
on how to use the tools most effectively, and how to identify genuinely valuable use cases.
Set up systems for sharing effective prompt templates, so that employees can benefit from
each others’ prompt engineering efforts. Finally, create clear guidelines on how employees
can and should use Generative AI, based on the company’s values, and the terms of use of
any tools the company subscribes to.
Behave responsibly: Generative AI models shouldn’t be used to generate content that is
harmful, misleading, or spam. That much you’ve hopefully already decided for yourself. You
might also have spotted the opportunity to use Generative AI to drive positive impacts, such
as creating content which rejects harmful stereotypes, rather than reproduce them. One
final consideration for the responsible use of Generative AI, is that it shouldn’t be used for
the sake of it. Not everything needs to be solved with this technology. For example, if the
text or image you need can be found with a simple Google search or by browsing a free stock
photo website, then do it! It will likely be faster, and use a lot less energy (a large image
generation model, for example, can require as much power as charging your smartphone,
just to generate a single image [50]). Even if you are certain that your use case requires
some kind of AI or machine learning, there’s a good chance that an approach already exists
which is more efficient, less prone to hacking or copyright concerns, and just as effective.
■ 16.6 In a Nutshell
How Large Language Models are built, and how they can be described
and categorized
LLMs are pretrained on vast amounts of language data to be able to predict
missing words in a sequence. This provides them with foundational knowl-
edge of language, which is further refined with instruction fine-tuning and re-
inforcement learning based on human preference judgements. This makes
them a dept at completing human requests in a broad range of tasks and do-
mains.
LLMs can be compared along various axes, including training regime, size (in
number of parameters), performance on benchmark tasks, and whether they
are open- or closed-source.
Prompt engineering best practices
When you want to get the best out of a Generative AI model, remember to ask
the question you actually want answered, and to guide the model through the
task. A number of best practices can help you do this, and it pays to practice
them, and experiment, on a variety of tasks, domains, and different models.
Common design patterns applying LLMs
Many people, including development teams hoping to build Gen-AI powered
applications, start with directly interacting with chatbots using clever prompt-
ing. When the LLM is limited in task specific knowledge, it can be provided
with additional documents containing tasks- or domain-specific information,
in what’s known as a “Retrieval Augmented Generation” architecture. An
alternative or additional option is to create an “LLM agent”, which is able to
break down complex tasks and orchestrate additional actions using tools,
accessed through, for example, code functions or APIs.
Customization techniques to improve LLM performance on your specific
use case
Many LLM providers offer tunable parameters which control how the LLM
samples from word distributions. By experimenting with these, you can make
the model more or less predictable, which can help you balance creativity
with reliability and adherence to guidelines like tone and style. You can gain
even more control by fine-tuning, either adapting all of the model’s weights,
or applying a parameter-efficient approach.
Gen AI models aren’t perfect
Despite their obvious usefulness, Generative models still suffer from various
limitations, including prompt injection and jailbreaking attacks, unintentional
biases, and hallucination: the accidental outputting of false or misleading
information. Questions of copyright also abound, such as whether AI-generated
works are copyrightable, who would own the copyright if so, and whether
copyrighted can data be used for training. We outlined some best practices to
tackle all of these limitations, but nevertheless, it pays to stay up to date in
this rapidly evolving field.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 499
References
[1] Bengio, Y., Ducharme, R., Vincent P., Jauvin, C., A neural probabilistic language model. JMLR.
(2003) 3, pp. 1137–1155
[2] Pham, N. Q., Kruszewski, G., Boleda, G., Convolutional Neural Network Language Models. Proceed-
ings of the 2016 conference on empirical methods in natural language processing (2016), pp. 1153–
1162
[3] Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S., Recurrent neural network based
language model. Interspeech (2010) 2:3, pp. 1045–1048
[4] Sundermeyer, M., Schlüter, R., Ney, H., (2012, ). LSTM Neural Networks for Language Modeling.
Interspeech (2012), pp. 194–197
[5] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin,
I., Attention is all you need. Advances in neural information processing systems (2017), 30
[6] Li, Y., Wei, F., Zhang, C., Zhang, H., Eagle: Speculative sampling requires rethinking feature un-
certainty. arXiv preprint, 2024, arXiv:2401.15077
[7] Gu, A., Dao, T., Mamba: Linear-time sequence modeling with selective state spaces. arXiv pre-
print, 2023, arXiv:2312.00752
[8] Jiang, A. Q., et al., Mistral 7B. arXiv preprint, 2023, arXiv:2310.06825
[9] Gui, J., Chen, T., Zhang, J., Cao, Q., Sun, Z., Luo, H., Tao, D., A survey on self-supervised learning:
Algorithms, applications, and future trends. arXiv preprint, 2023, arXiv:2301.05712
[10] Wei, J., et al., Fine-tuned language models are zero-shot learners. arXiv preprint, 2021, arXiv:
2109.01652
[11] Ouyang, L., et al., Training language models to follow instructions with human feedback. Advances
in neural information processing systems (2022), 35, 27730–27744
[12] Parnami, A., Lee, M., Learning from few examples: A summary of approaches to few-shot learning.
arXiv preprint, 2022, arXiv:2203.04291
[13] Luo, J., Li, T., Wu, D., Jenkin, M., Liu, S., Dudek, G., Hallucination Detection and Hallucination
Mitigation: An Investigation. arXiv preprint, 2024, arXiv:2401.08358
[14] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P., Lost in the mid-
dle: How language models use long contexts. Transactions of the Association for Computational
Linguistics (2024), 12, 157–173.
[15] Lewis, P., et al., Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in
Neural Information Processing Systems (2020), 33, 9459–9474
[16] Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., Artzi, Y., Bertscore: Evaluating text generation
with bert. arXiv preprint, 2019, arXiv:1904.09675
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
500 16 Generative AI and Large Language Models
[17] Zhao, W., Peyrard, M., Liu, F., Gao, Y., Meyer, C. M., Eger, S., MoverScore: Text generation eval
uating with contextualized embeddings and earth mover distance. arXiv preprint, 2019, arXiv:
1909.02622
[18] Sellam, T., Das, D., Parikh, A. P., BLEURT: Learning robust metrics for text generation. arXiv pre-
print, 2020, arXiv:2004.04696
[19] Zheng, L., et al., Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural
Information Processing Systems (2024), 36
[20] Fu, J., Ng, S. K., Jiang, Z., Liu, P., Gptscore: Evaluate as you desire. arXiv preprint, 2023, arX-
iv:2302.04166
[21] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C., Gpteval: Nlg evaluation using gpt-4 with better
human alignment. arXiv preprint, 2023, arXiv:2303.16634
[22] Manakul, P., Liusie, A., Gales, M. J., Selfcheckgpt: Zero-resource black-box hallucination detection
for generative large language models. arXiv preprint, 2023, arXiv:2303.08896
[23] Guo, Z., et al., Evaluating large language models: A comprehensive survey. arXiv preprint, 2023,
arXiv:2310.19736
[24] Liang, P., et al., Holistic evaluation of language models. arXiv preprint, 2022, arXiv:2211.09110
[25] Wang, L., et al., A survey on large language model based autonomous agents. Frontiers of Computer
Science (2024), 18(6), 1–26
[26] Wei, J., et al., Chain-of-thought prompting elicits reasoning in large language models. Advances in
neural information processing systems (2022), 35, 24824–24837
[27] Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y., React: Synergizing reasoning and
acting in language models. arXiv preprint, 2022, arXiv:2210.03629
[28] Wang, X., et al., Self-consistency improves chain of thought reasoning in language models. arXiv
preprint, 2022, arXiv:2203.11171
[29] Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K., Tree of thoughts: Deliber-
ate problem solving with large language models. Advances in Neural Information Processing Sys-
tems (2024), 36
[30] Lialin, V., Deshpande, V., Rumshisky, A., Scaling down to scale up: A guide to parameter-efficient
fine-tuning. arXiv preprint, 2023, arXiv:2303.15647
[31] Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y. J., Ma, Y., Investigating the catastrophic forgetting
in multimodal large language models. arXiv preprint, 2023, arXiv:2309.10313
[32] Houlsby, N., et al., Parameter-efficient transfer learning for NLP. International conference on
machine learning (2019), pp. 2790–2799
[33] Lester, B., Al-Rfou, R., Constant, N., The power of scale for parameter-efficient prompt tuning.
arXiv preprint, 2021, arXiv:2104.08691
[34] Li, X. L., Liang, P., Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint,
2021, arXiv:2101.00190
[35] Zhang, R., et al., Llama-adapter: Efficient fine-tuning of language models with zero-init attention.
arXiv preprint, 2023, arXiv:2303.16199
[36] Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., Raffel, C. A. Few-shot parameter-ef-
ficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information
Processing Systems (2022) 35, pp. 1950–1965
[37] Hu, E. J., et al., Lora: Low-rank adaptation of large language models. arXiv preprint, 2021, arXiv:
2106.09685
[38] Liu, J., Gong, R., Wei, X., Dong, Z., Cai, J., Zhuang, B. , Qllm: Accurate and efficient low-bitwidth
quantization for large language models. arXiv preprint, 2023, arXiv:2310.08041
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 501
[39] Rajbhandari, S., Rasley, J., Ruwase, O., He, Y. , Zero: Memory optimizations toward training trillion
parameter models. In SC20: International Conference for High Performance Computing, Networking,
Storage and Analysis (2020), pp. 1–16
[40] Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M.: Universal and transferable adversarial attacks on
aligned language models. arXiv preprint, 2023, arXiv:2307.15043
[41] Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O., The
Reversal Curse: LLMs trained on ”A is B” fail to learn ”B is A”. arXiv preprint, 2023, arX-
iv:2309.12288.
[42] Brittain, B.: AI-created images lose U. S. copyrights in test for new technology. Reuters, 2023,
https://www.reuters.com/legal/ai-created-images-lose-us-copyrights-test-new-technology-2023-02-22/
[43] Artificial Intelligence and Intellectual Property: copyright and patents. Intellectual Property
Office, 2022, https://www.gov.uk/government/consultations/artificial-intelligence-and-ip-copyright-
and-patents/artificial-intelligence-and-intellectual-property-copyright-and-patents
[44 ]Vincent, J.: AI art tools Stable Diffusion and Midjourney targeted with copyright lawsuit. The
Verge, 2023, https://www.theverge.com/2023/1/16/23557098/generative-ai-art-copyright-legal-lawsuit-
stable-diffusion-midjourney-deviantart.
[45] Robertson, K.: 8 Daily Newspapers Sue OpenAI and Microsoft Over A. I. The New York Times, 2024,
https://www.nytimes.com/2024/04/30/business/media/newspapers-sued-microsoft-openai.html.
[46] Cho, W.: Sarah Silverman, Authors See Most Claims Against OpenAI Dismissed by Judge. The
Hollywood Reporter, 2024, https://www.hollywoodreporter.com/business/business-news/sarah-sil
verman-openai-lawsuit-claims-judge-1235823924/
[47] Andersen v. Stability AI Ltd. casetext.com, 2023, https://casetext.com/case/andersen-v-stabili-
ty-ai-ltd
[48] Cho, W.: Artists Lose First Round of Copyright Infringement Case Against AI Art Generators. The
Hollywood Reporter, 2023, https://www.hollywoodreporter.com/business/business-news/artists-
copyright-infringement-case-ai-art-generators-1235632929/.
[49] Nicoletti, L. and Bass, D: Humans are biased. Generative AI is even worse. Bloomberg, 2023,
https://www.bloomberg.com/graphics/2023-generative-ai-bias/
[50] Heikkilä, M.: Making an image with generative AI uses as much energy as charging your phone.
MIT Technology Review, 2023, https://www.technologyreview.com/2023/12/01/1084189/making-an-
image-with-generative-ai-uses-as-much-energy-as-charging-your-phone/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Natural Language
Processing (NLP)
Katherine Munro
1
https://app.inferkit.com/demo
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
504 17 Natural Language Processing (NLP)
sumer trends, while a linguist may use it to research fundamental questions about human
nature and how we understand and use this incredible and unique gift, language.
The second goal – developing language-based technologies – is valuable because people
want to communicate. Of course this is true in our interactions with other humans, in which
case NLP technologies such as voice dictation, machine translation, and real-time confer-
ence captioning are all vital. Yet it is even true with our interactions with other tools. Gone
are the days when we adapted ourselves for “the machines”, such as formulating our web
search queries in a way we thought the browser would understand. Today, we talk to our
mobile phones, our smart home devices, and even our cars, and we expect them to commu-
nicate effectively back. This trend was massively accelerated by the introduction of ChatGPT
and other technologies based on Large Language Models (LLMs, which we covered in detail
in Chapters 15 and 16). Now that people have seen what this technology can do, and have
interacted directly with chatbots and related tools themselves, the expectation of “conver
sational everything” has only been heightened.
So, how do we achieve the two key goals of NLP – being able to analyze natural language
data, and developing language technologies – and gain value from them? One half of the
equation is Natural Language Understanding (NLU), which involves bringing structure to
raw text sequences by extracting certain required details. This structured information may
be used in language analyses, or to complete a task: when a user asks their smart home
speaker device to play an album on Spotify, for example, the home-assistant software must
extract the user’s intent — to play some music — and the specific album required, in order to
complete the request. The other half of the challenge is Natural Language Generation
(NLG), which is about producing human-like text from a prompt or a structured information
object. It is often used to request further information from a user, or to inform them that
their task has been completed.
In the past, separate models were required for the specific subtasks that NLU and NLG en-
tail. In the case of the smart home speaker, for example, a typical pipeline could include one
model which detected the user’s intent (such as “play music” versus “turn off the lights”),
and another model which extracted the required attributes (such as a song title or location
of the desired weather forecast) from the user’s utterance. It was a lot of work, and errors
from any component model would bring the accuracy of the whole pipeline down. As the
NLP field progressed, we moved towards neural network architectures which could com-
plete the entire process end-to-end. Eventually, we struck gold with the invention of LLMs.
These are characterized by so-called “emergent abilities,” which means that, without any
specific or additional training, they’re capable of completing a huge array of NLP and NLG
tasks. This is where much of their value lies: where companies previously needed to collect
vast amounts of labeled training data to build, train and deploy multiple separate NLU mod-
els just to provide one type of functionality to their end-users, they can now potentially
instruct a single, publicly available LLM to do the job, without even needing to provide
additional training data. Thus, the range of potential use-cases, and the speed with which
they can be developed, has skyrocketed.
NLP, the discipline which makes all of this possible, is a combination of methodologies from
linguistics, data science, and AI. In this chapter, we will examine some of the most common
tools and techniques used in NLP, following a historical timeline in order to demonstrate the
decades of exciting research which have brought us to where we are today. As you are read-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.2 Why Learn “Traditional” NLP in the “Age of Large Language Models”? 505
ing, look out for the blue boxes explaining how these concepts are applied in various NLP
applications: you may be surprised at how many you interact with on a daily basis.
■ 17.2 W
hy Learn “Traditional” NLP in the
“Age of Large Language Models”?
Chapters 15 and 16 explore the workings of Large Language Models (LLMs) and Generative
AI, a new field of Artificial Intelligence, which includes LLMs and deals with generating new
texts, images, videos, models, and more. Much lauded in the media, there’s no doubt that
LLMs and “Gen AI” are powering many of the breakthroughs we’re seeing almost daily in
the NLP world. But if you want to truly understand and apply NLP, it’s important to know
the fundamentals. There are four key reasons for this.
Firstly, not every problem can or should be solved with LLMs. These models are fantastic
at using their vast world “knowledge” and creativity to generate novel, long-form content,
where multiple correct solutions are possible. But many real-world data use cases seek the
exact opposite of this. They require extracting specific, concrete information from unstruc-
tured data, and usually, there’s only one correct answer. Certainly LLMs can do this too: if I
copy a customer inquiry email into a chatbot and ask it to extract the customer details and
inquiry topic into a JSON string, it’ll do it. But so could an entity extraction model, and such
a model will generally have lower latency, be easier to evaluate, and potentially be more in-
terpretable. Thus, while LLMs could be useful in prototyping a pipeline which performs
entity extraction as one of its stages, the final result may be more practicable with a tradi-
tional supervised learning model. Of course, you’d need a labeled training dataset for this,
but here’s a saving grace for the LLM: you can potentially use it to generate that data for you!
Secondly, not all problems that could benefit from Generative AI are using it (yet).
Outside of the headlines and press releases by the world’s largest research organizations,
who are using LLMs to solve NLP problems end-to-end, many companies aren’t up to that
level. Some of them are figuring out what this technology can do, others are even building
their first LLM-powered solutions, but many are realizing the challenges with bringing such
a product into production. Best practices and established design patterns don’t yet exist for
developers to turn to. Many new tools designed to help build LLM systems are not yet robust
enough to be relied upon. Issues like complexity and latency when making multiple LLM
calls, and security when connecting LLMs to external tools, can massively slow the pace of
development. Finally, difficulties figuring out how to evaluate an LLM’s outputs make it
harder to measure the value of the solution, and thus, harder to justify the continued R&D
effort in the face of such issues. For this reason, if you’re a developer or data scientist want-
ing to dive into using LLMs 100% of your working time, you may be disappointed.
A third and related motivation for learning “traditional” NLP is that plenty of companies
with working NLP systems have no incentive to start over with Gen AI. They do need to
maintain their existing NLP systems in production, however, which means they still need
employees who know how to debug text preprocessing pipelines, evaluate NLP models, and
maybe even extract new features from text data, to continually improve the existing system.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
506 17 Natural Language Processing (NLP)
If these companies do decide to experiment with LLMs, it’ll likely be to tackle brand new
problems first. Thus, it will take quite some time (if it happens at all) before existing solu-
tions using “traditional” NLP techniques become entirely obsolete.
Finally, it’s important to remember that many LLM-powered solutions are built on NLP
pipelines. Training or fine-tuning an LLM requires text data which has been gathered,
cleaned, and formatted consistently. All of this requires NLP skills. If your development
team wants to filter the input data, prompts, or model outputs for toxic content, you’ll need
NLP for that, too, as you’ll be implementing something like keyword filters or content clas-
sification models. The same goes if you want to apply quality control to an LLM’s responses,
pulling a human into the loop in cases where the quality is detected as low: tasks like this
are still sometimes done with traditional NLP techniques and supervised models. Similarly,
say you want to build a Retrieval Augmented Generation system, in which documents are
broken into chunks, embedded, and retrieved as context for an LLM, based on their rele-
vance to the query. Creating such an application requires experimenting with embedding
methods and document segmentation strategies, and figuring out how much chunk overlap
is needed to ensure that all relevant information is preserved for the LLM, without exceed-
ing its input limits (known as its “context window”). NLP skills can help you figure out all
of those issues, and more.
So, with all of that in mind as inspiration, it’s time to learn some fundamental NLP skills.
Across the world, more and more data are being produced, every single day, at an ever-
increasing pace. But it’s not all contained in nicely prepared spreadsheets and databases:
A lot of it is an unstructured mess, and text data is a huge contributor to the chaos. For
example, just think of all the different styles of websites which exist now and are newly
created every day: there’s a wealth of information to be found, but it’s in different formats,
languages, styles, and often contains encoding errors, spelling errors, hashtags instead of
sentences, emojis instead of words, and so on.
In order to get value from text data, we need to first bring structure to it. This section will
introduce you to some of the many techniques which can help in this data preparation step.
Data scientists, business analysts and computational linguists generally pick the techniques
which are appropriate to their end goal, and then string them together into a “pipeline”,
where each technique modifies the text data before passing it onto the next step. The final
result is a new dataset which has been cleaned and structured in a systematic way, making
it suitable for further applications such as analytics, information extraction, linguistic re-
search, and machine learning.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.3 NLP Data Preparation Techniques 507
For installation on Windows, the NLTK providers recommend you follow the instructions at
https://www.nltk.org/install.html.
In order to follow along with the practical NLP pipeline steps in this chapter, please down-
load and import the following packages:
import nltk
nltk.download(‘punkt’)
nltk.download(‘wordnet’)
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
Next, we will define a small text which we will use for processing:
2
See also, for example, spaCy (https://spacy.io/) for NLP in Python and Stanford Core NLP if working in Java
(https://stanfordnlp.github.io/CoreNLP/)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
508 17 Natural Language Processing (NLP)
The tag labels in this sentence (there are many more in the full tagset) correspond to the
following parts of speech: DT = Determiner, NN = noun (singular), VBZ = verb (3rd person,
singular, present), VBG = verb (gerund or present participle), JJ = adjective, IN = preposi-
tion, and PRP$ = possessive pronoun. You can also retrieve information about this tagset in
NLTK using:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.3 NLP Data Preparation Techniques 509
And finally, you can generate a list of tuples of (token, tag) using:
biography_tokens_and_tags = nltk.pos_tag(biography_tokens)
If this is the first time you are reading about POS tagging, you may think the concept (if not
the practice) is rather simple. For English, it may be, but this is certainly not a universal.
Many languages feature much more complicated grammatical roles and syntactic functions
than English. Furthermore, while English tends to use separate words to represent individ-
ual units of meaning, many other languages ‘compose’ words out of individual morphemes,
which makes learning one POS tag per word impossible.
Consider the following example (reproduced from [4]) from the North American Indian
language, Chinook. The single word “ačimluda” actually conveys the meaning of an entire
sentence, “he will give it to you.” This one word contains the elements a- (indicating the
future), -č- (meaning “he”), -i- (meaning “him”), -m- (meaning “thee,” an archaic form of
“you”), -l- (meaning “to”), -ud- (meaning “give”) and -a (again indicating the future). This is
an example of an agglutinating language, as are Turkish, Finnish and Japanese, among
others. For such languages, we often require morphological disambiguation to first iden-
tify the units of meaning within each word (conceptually this is similar to tokenization),
followed by morphological tagging, as opposed to POS tagging. These techniques are be-
yond the scope of this chapter, but we encourage you to perform your own research, if you
would like a fuller understanding of just one of the ways multi-lingual NLP can be both
challenging and fascinating.
Stemming
Stemming uses algorithms based on rules (often derived from statistics) or machine learn-
ing to strip the endings from words, leaving only the word stem. This stem may not always
be a grammatical word. For example, converting plural nouns to singular can be done by
removing the suffix “-s”, but this won’t work for irregular English nouns. Thus we get
“dogs” → “dog”, but “countries” → “countrie”, and “women” → “women”. Similar problems
arise in other languages, too. For example, in German many plural nouns can be converted
to singular by removing the suffix “-en” or “-er”, but irregular nouns pose problems, too.
Thus we get “Frauen” → “Frau” (“Women” → “Woman”), which is correct, but “Bücher” →
“Büch” (“Books” → “Book”, where the latter should actually be spelled, “Buch”).
This may sound like a serious problem, but it doesn’t have to be. At least for statistical ma-
chine learning projects, creating stems which are also correct words is not particularly im-
portant: what is important is that each individual word is stemmed consistently and that at
inference (test) time, the new data is stemmed in the same way.
For analysis projects, particularly linguistic research, the creation of non-words via stem-
ming may be more problematic. In that case, lemmatization offers a solution. Before we
discuss it, let us perform stemming on our biography_tokens:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
510 17 Natural Language Processing (NLP)
snowball = SnowballStemmer(language=’english’)
biography_stems = [snowball.stem(token) for token in biography_tokens]
print(‘ | “.join(biography_stems))
Lemmatising
Lemmatising, or “lemmatization,” generally relies on dictionary lookup to convert each
word to its standard form. Again an example could be reducing plural nouns to singular, but
with lemmatizing, the result should also be a complete word, recognized in the given lan-
guage as the official standard form of the original input word. This accuracy may come at
the cost of being slightly slower than a stemming algorithm, however, modern machines
and methods largely negate this cost. A more significant problem is the need for language-
specific dictionaries, which require considerable manual work in their creation, and thus
tend only to be available for a few of the world’s thousands of languages.
Another problem for lemmatizers is that they cannot handle novel words, such as “twerk-
ing” or “vaping”, which have only officially entered the dictionary in recent years. In con-
trast, stemming is often able to handle novel words, as they tend to follow the same mor
phological inflection rules as other words of the same grammatical Part-of-Speech (see
Part-of-Speech Tagging, below) in the same language. For example, if you are an English
speaker and encounter a new verb like “vape”, you already know that it can be inflected as
“vaped” or “vaping”, depending on the context. Similarly, German speakers will automati-
cally know that the past tense of “twerk” is “getwerked”, however odd it may sound! This
fascinating property of language is referred to in linguistics as “productivity”, and can be
useful in defining rule-based NLP applications, such as stemmers. This is demonstrated in
the code below:
wordnet_lemmatizer = WordNetLemmatizer()
novel_words = [‘twerking’, “twerks’, “twerked’, “vaping’, “vapes’, “vape’]
print(‘ | “.join([snowball.stem(t) for t in novel_words]))
print(‘ | “.join([wordnet_lemmatizer.lemmatize(t) for t in novel_words]))
While the above code will work, the WordNetLemmatizer actually performs better if given
the parts of speech corresponding to each token. Unfortunately, the tagset used by this lem-
matizer differs to the Penn Treebank tags used by NLTK’s POS Tagger. Thus, in the below
code, we take our list of tuples of (token, tag) and convert the tags, first.
biography_tokens_and_wordnet_tags = []
for tok, pos in biography_tokens_and_tags:
new_pos = upenn_to_wordnet_pos(pos)
biography_tokens_and_wordnet_tags.append((tok, new_pos))
Stemming and lemmatization reduce the size of the vocabulary that needs to
be processed, while ensuring that all words are represented in a consistent
way. For analysis purposes, this can help us identify more general trends
in the data. For example, much of the initial work on stemming and lemmati-
zation was born from research into information retrieval. Meanwhile, for
statistical machine learning tasks, making the vocabulary smaller and the
individual word representations more consistent reduces our feature set
(and therefore model size), while providing more training examples for each
generalized version of the word than there would have been for its original
variations. Both of these factors can improve the model’s performance.
Stopword Removal
Languages contain many words which occur frequently but add little meaning, such as arti-
cles and prepositions. Consider the sentence, “the dog bites the man.” If we remove the
article, “the”, we end up with, “dog . . . bites . . . man”. While no longer grammatical, most of
the meaning is preserved. Thus, “stopwords” don’t contribute much to textual analyses. Nor
do they help statistical machine learning models such as classifiers, since they appear in all
contexts. Hence, they are often ignored in classification and, in particular, information re-
trieval tasks. Domain-specific stopwords may be removed, too. For example, if certain to-
kens appear frequently and in all contexts in your domain, you may wish to exclude them.
NLTK contains stopwords in various languages. The code below shows how to display the
English stopwords, and remove them from a text:
stopwords_en = set(stopwords.words(‘english’))
print(‘ | “.join(stopwords_en))
biography_no_stopwords = [l for l in biography_lemmas_improved if l.lower() not in
stopwords_en]
Note, however, that stopwords can be crucial for NLU. For example, the word “not” (and it’s
equivalent in other languages) is often included in stopword lists, and yet “not X” has a very
different meaning to “X”. Furthermore, each individual employs stopwords with a certain
style, and this information can be used for tasks such as author profiling: identifying the
speaker or author of a text, which can be used for purposes ranging from historical linguis-
tics (see, for example, the controversy over who really wrote some of Shakespeare’s plays)
to forensic linguistics (such as identifying criminals in online chatrooms).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
512 17 Natural Language Processing (NLP)
Figure 17.2 Demonstration of entities which could be extracted using Named Entity Recognition
Noun Chunking is like a simpler version of NER: we are not interested in what kind of
“thing” an entity is, we simply want to extract it from the text.
The code for extracting noun chunks and Named Entities is virtually identical in NLTK,
differing only in the value of the “binary” argument:
Notice that the output of NLTK’s noun_chunk() function is the entire sequence as an nltk.
tree, a class for representing hierarchical language structures. Every recognized noun
chunk within that tree is, itself, represented as a tree. This is why the code above checks for
trees in the output, and prints them.
POS tags and Named Entities are often used as input features for solving
NLP tasks with statistical machine learning. Note that, interestingly,
POS-Taggers and Named Entity Recognizers are themselves often built via
statistical machine learning.
Other use-cases include annotating data for linguistic research and performing
rule-based information extraction by retrieving the Named Entities or words
with a certain POS-Tag, such as “Noun”.
case you will wish to create features which describe each token. These could include any
combination of attributes such as the first or last n characters, the previous or next word, or
the word shape. What do we mean by “word shape”? This could be represented as an “X” for
a capital letter, an “x” for a lowercase letter, a “0” for a digit and a “.” for a punctuation mark,
for example. A username like DataWhiz123! would then be converted to XxxxXxxx000.
If you are using your text data for a machine learning project, you will likely need to convert
it into a format the training algorithm can understand. This could be a data frame (as shown
in Figure 17.3), a dictionary of words and their features, or some other input structure.
In the following, we will introduce just some of the possible and common techniques. We
will hold off on our discussion of word embeddings, another popular method, until Sec-
tion 17.4.4).
Figure 17.3 Example representation of part of an input sequence after performing custom feature
extraction for a statistical machine learning model
Section 17.4.1), statistical machine learning (Section 17.4.2), neural network approaches
(Section 17.4.3), and approaches which take advantage of the benefits of transfer learning
(Section 17.4.4). Each category is introduced below with a selection of typical tasks and the
techniques which may be used to solve them. Some advantages and challenges for each
category are also presented.
Machine Translation
Rule-based machine translation usually involves translating individual words and phrases
via dictionaries, and then, if the target language requires it, reordering those words using
hand-written word-order rules. For example, in this translation from English to German, the
words are first translated using a dictionary, and then reordered:
Yesterday - I - read - a - wonderful - book
Gestern - ich - habe gelesen - ein - wunderbares - Buch
Gestern habe ich ein wunderbares Buch gelesen
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
516 17 Natural Language Processing (NLP)
Unfortunately, long-distance dependency relations between words, such as the relation be-
tween “habe” and “gelesen” above, make writing phrase-based rules difficult. This is partic-
ularly true in long sentences with nested structures, such as we get when we add the adverb
clause, “during my train trip,” to our above example:
Gestern, während meiner Zugfahrt, habe ich ein wunderbares Buch gelesen.
Lexical and syntactic ambiguities, which occur when a word or phrase could be interpreted
multiple different ways, also cause problems. For this reason, rule-based machine trans
lation has all but been replaced by statistical machine learning and neural network ap-
proaches, which we will see later in this section.
Grammar Parsers
Grammars are sets of rules which attempt to describe all possible sentences in a language.
Parsers are algorithms capable of building a data structure from an input. In the 1950s,
computational linguists attempted to write grammars and create parsers to process them, in
order to automatically identify syntactic structures in texts. The goal of such work was to
add rules to rule-based machine translation systems which incorporated these automati-
cally identified structures (the code example below shows you how to extract and view one
kind of syntactic structure yourself). Such rules could assist with the ambiguities problem
referred to above.
Developing rule-based grammar parsers is a challenge in itself, though, which is why they
are no longer common. However they can still be useful in certain cases. For example, imag-
ine you would like to perform entity extraction for entity types which are not recognized by
commonly available Named Entity recognizers (see “Named Entity Recognition and Noun
Chunking” in Section 17.3.1 for some common entity types). You could build a grammar
parser to perform this extraction. Furthermore, you could even use this parser to create data
to train your own entity recognizer, perhaps automatically extracting a shortlist of training
examples first, then reviewing them and throwing away any false positives, and finally,
training your NER model from there.
The following code snippet defines a grammar for a noun chunk, then uses NLTK’s regular
expressions parser to identify chunks in some data we prepared under “Part-of-Speech Tag-
ging” in Section 17.3.1.
grammar = r”””
POSS: {<DT>?<NN|NNP><POS>}
PP:{<IN><NN|NNP>}
NP: {<POSS|DT>?<JJ>*<NN|NNP>+<PP>*}
“””
parser = nltk.RegexpParser(grammar)
biography_noun_chunks = parser.parse(biography_tokens_and_tags)
print([i for i in biography_noun_chunks if type(i) == nltk.tree.Tree])
The example might take a little explaining. The variable “biography_tokens_and_tags” con-
tains a list of tuples of (token, POS tag). The grammar consists of the following components:
POSS: Zero or one determiner (DT) followed by a noun (NN) or proper noun (NNP) fol-
lowed by a possessive marker, “-’s”
Example: This would recognise “The girl’s” in “The girl’s laptop”.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 517
Despite the difficulties rule-based systems pose, they are still used in a
number of situations. For example:
when the domain and variability of inputs are restricted: such as in simple,
template-based chatbots
when the input data is highly structured: such as for categorizing IT-support
tickets, or summarizing information into simple texts, like weather and
sports reports
when the available data is small: such as machine translation for low
resource languages
when it can improve more complex systems: for example, rule-based
lemmatization can be used to prepare text data for some statistical
language processing system.
If writing language rules is so hard, how do infant humans acquire their native grammar so
effortlessly? They learn from exposure to language data: hearing other humans talk. This is
also the idea behind statistical machine learning approaches to NLP, which came to promi-
nence in the late 1980s to 1990s. These work by making probabilistic decisions about the
most likely output for a given input, based on frequencies collected from previously ob-
served data.
In this section, we will observe statistical machine learning tasks and techniques together,
in the context of real-world examples, to aid your understanding. This list is not exhaustive:
some of the other algorithms introduced in Chapter Twelve, for example, could also be ap-
plied to text data once it is converted into the appropriate input format (Support Vector
Machines, for example, can be used for tasks such as document classification or information
extraction). However, this list should provide you with a solid foundational knowledge of
common approaches.
A Naive Bayes classifier uses the so-called, “Bayes rule” to define P(c|d), the probability of
class c given a document d. In real words, given a movie review, it tries to calculate how
likely it is that that review is positive or negative, using:
Equation 17.1
Let’s walk through this together. Imagine the document is a very simple review, d=[“Excel-
lent”, “movie”], and the class is c=positive. P(d|c), called the “likelihood”, means, “how
likely is this review, given that we know its class is positive?” P(c), the “prior”, asks, “of all
our training documents, what percentage of them are positive?” We multiply these, then
divide the result by a normalization constant, P(d). This will ensure that all results are
within the range [0,1] and that the final probability distribution sums to one.
The algorithm learns these probabilities from the training data (which will have been vec-
torized in some manner, such as using a bag-of-words encoding (Section 17.3.2)). So to learn
the likelihood, P(d|c), we:
1. Take the first word in d and count how often it appeared in the dataset in a positive doc-
ument: that gives us P(word1|c).
2. Repeat this for all n words in the document.
3. multiply all the P(wordx|c)s together and replace P(d|c) with that value, i. e. P(word1,
word2, . . . wordn|c).
The remaining probabilities, P(c) and P(d), can be directly counted from the data.
Equation 17.1 could theoretically be used to create a “Bayes Classifier”. However, given the
infinite ways words can be combined, it becomes impractical to calculate probabilities for all
possible combinations. Thus we make some n aive assumptions, which is what gives this
classifier its name. First, we assume that word order does not affect meaning (though of
course, it does). This enables us to care only about which words are in a document, and to
treat “Great movie. Loved it!” and “Loved it! Great Movie!” as one combination. Second, we
assume that P(word1|c), P(word2|c), and so on are independent (though of course they are
not, since the likelihood of any word is affected by what comes before it). This “conditional
independence assumption” means the term P(word1, word2, . . . wordn|c) from earlier can be
expressed as P(word1|c) * P(word2|c) * . . . * P(wordn|c), which is much easier to calculate.
Effectively, instead of training a huge number of parameters that would require a lot of data,
we simply calculate them from another smaller set of parameters whose estimates require
smaller amounts of data. Thanks to these assumptions, a Naive Bayes Classifier will have far
fewer parameters than a Bayes Classifier and will learn much faster.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 521
For optimization problems, it is often desirable to omit the square root and retain the
squared Euclidean distance, as this is a convex function which allows convex analyses to be
used. Fortunately, minimizing the squared Euclidean distance is equivalent to minimizing
the Euclidean distance, but easier to solve. Thus, in the K-Means algorithm, we are con-
cerned with the sum of the squared distances between the centroid and the points assigned
to it:
Identifying trends and insights from free-form (that is, unstructured) text
Search engines and document retrieval: which document cluster best
matches this query?
Summarizing information: what are the key topics?
Product recommendation systems: so-called “content-based filtering”
revolves around understanding characteristics of the product or media
being recommended; this can be extracted from product descriptions
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
522 17 Natural Language Processing (NLP)
Given this training data, and the following test sequence, we would incorrectly tag “run”
as a verb, when in this case it is actually a noun. (As a side tip, note the use of the aster-
isk in this example, which in linguistics generally means “This is incorrect,” or, “There
are no records for such a usage”):
05 *Pronoun Verb Possessive Adjective Noun Verb
I enjoyed my morning run
Clearly, performing POS tagging using only the most common tag per word is not appropri-
ate. A common approach instead is to use Hidden Markov Models (HMM). As stochastic
processes, HMMs are useful when we can observe a sequence, such as a POS-tagged sen-
tence, but we cannot observe the process of hidden states which created that sequence. That
is, we cannot observe the grammar.
Let’s tag 5. properly, and add it to our training set as example 6:
06 Pronoun Verb Possessive Adjective Noun Noun
I enjoyed my morning run
The POS tags represent hidden states, and the transition probabilities are likelihoods of
transitions between hidden states, such as P(Noun → Verb), which is the likelihood that a
verb follows a noun. The emission probabilities are likelihoods that each word really could
be output from the indicated hidden state. For example, P(Verb|“I”) should equal zero in
English, as “I” can never be a verb.
Our tiny training set has 17 transitions, and the word “run” appears four times. So the tran-
sition probabilities would include*:
P(Noun → Verb) = 0.06 (occurred in 1/17 transitions)
P(Noun → Noun) = 0.12 (occurred in 2/17 transitions)
. . . and the emission probabilities would include (but not be limited to):
P(“run”=Verb) = 0.5 (occurred 2/4 times)
P(“run”=Noun) = 0.5 (occurred 2/4 times)
Once all of these probabilities are learned (counted) from the training set, they can be used
to calculate the likelihood of the sequences 7. and 8., below. This is done by multiplying the
transition and emission probabilities at each timestep (individual word) together. So al-
though “run” appears equally often as a noun and a verb, and thus the emission probabili-
ties are the same for this word for both parts of speech, once we also consider the transition
probabilities, we find the most likely (and correct) sequence is 8.:
07 *Possessive Adjective Noun Verb Verb Adverb
My morning run is fun
08 Possessive Adjective Noun Noun Verb Adverb
My morning run is fun
For longer sequences, these probabilities become much more complicated to calculate. But
algorithms such as the Viterbi algorithm [6] can be used to efficiently find the output se-
quence which maximizes the likelihood of the model.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
524 17 Natural Language Processing (NLP)
We can read this as: the probability of observing the sequence (w1, . . . , wm) equals the prod-
uct of the probabilities of observing the ith word wi after the preceding i − 1 words. This is
approximately equal to the probability of observing wi in the smaller context window of the
preceding n − 1 words. For example, below we see uni- and bi-gram model representations
of the probabilities for a simple sequence. Note that “BOS” stands for “Beginning of Se-
quence” and “EOS” for “End of Sequence”. As discussed above with respect to Naive Bayes
Classifiers, these probabilities can be ascertained using counts in a training dataset:
Puni-gram(w1, w2, w3, w4, w5) = P(w1) * P(w2) * P(w3) * P(w4) * P(w5) = P(BOS) * P(“I”) *
P(“like”) * P(“pizza”) * P(EOS)
Pbi-gram(w1, w2, w3, w4, w5) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * P(w5|w4)
= P(“I”|BOS) * P(“like”|“I”) * P(“pizza”|“like”) * P(EOS|“pizza”)
A trained statistical language model is useful for disambiguating between multiple possible
interpretations of an input, in cases of syntactic or lexical ambiguity. For example, there’s
nothing in the English grammar to prevent the phrase, “the chicken is ready to eat,” from
meaning the chicken is hungry and waiting for a meal. But we already know this inter
pretation is unlikely, even absurd, and our use of language would reflect that. A language
model’s probabilities will, too.
Equation 17.2
Here P(f|e) is called the “translation model” and represents the probability that the French
source string is the translation of the English target string. P(e) is the “language model”:
a model of the target language, English, from which we can ascertain the probability of
seeing that target language string.
Let’s clarify that with an example. P(e|f) is, as we know, the probability of seeing e, given f.
In terms of statistical machine translation, it is the probability of a source string (say, the
French phrase, “le chat”) producing a specific target language string (in this case, the
English phrase, “the cat”). This P(e|f) is proportional to the probability of “the cat” translat-
ing into “le chat”, multiplied by the probability of seeing “the cat” in English. For any given
f, the best translation e is that which maximizes:
Equation 17.3
In early, word-based machine translation, algorithms were trained on parallel language cor-
pora – pairs of sentences translated by humans – and learned to predict the most common
translation of individual words. Unfortunately, such datasets are costly and difficult to
produce. Furthermore, languages often contain phrases whose meaning is different from
the meaning of the individual words, making this an inappropriate approach. For example,
“real estate” should definitely not be translated to “domaine nouvelle” in French, or “echter
Nachlass” in German. For this reason, Phrase-Based Machine Translation was introduced.
The process is as follows:
1. Split texts into individual tokens (unigrams) and “phrasemes” of two or three tokens (bi-
and tri-grams). For example:
“real” (unigram), “real estate” (bigram), and “real estate agent” (trigram).
2. Translate the phrasemes probabilistically using word and phrase alignment tables. Such
tables consist of pairs of words or phrases paired between the source and target lan-
guage; the tables themselves are usually automatically detected as an intermediate step
in the overall process (see [7], for one illustration). In the following example, the first
translation would be chosen, as its probability is higher than that of the second:
3. P(“der Immobilienbranche”|“real estate”) > P(“echter Nachlass”|“real estate”)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
526 17 Natural Language Processing (NLP)
Equation 17.4
This formula indicates that ht is derived by applying the nonlinear function f (which is typ-
ically some kind of RNN, such as an LSTM), to the current token, xt, and previous encoder
hidden state, ht-1. The context vector, c, is generated by applying another nonlinear function,
q, to the entire series of generated hidden states:
Equation 17.5
The encoder information flow is represented in the left half of Figure 17.6, where, at time-
step “C”, for example, the network receives the vector for “C”, plus the hidden state for “B”,
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 529
which itself will have been influenced by the hidden state for “A”. Encoding stops once the
encoder has processed the “EOS” token (recall from our introduction to statistical language
modelling that this stands for “End of Sequence”). The final encoder output will be a repre-
sentation of the entire input sequence, with dimensions equal to the number of hidden
units in the final encoder network layer.
Decoder: The “decoder” neural network takes the final encoder state and, also working one
step at a time, begins generating an output sequence based on learned conditional proba
bilities for combinations of possible outputs. Importantly, the hidden state representations
of the decoder’s own predictions at each step are fed back into the decoder, such that it can
use them for context while continuing to generate output predictions. This can be seen in
the right half of Figure 17.6: when the network outputs “W”, this is fed back into the net-
work to aid in predicting “X” (remember “X” could be anything, such as a POS tag or NER
label, an output word in a target language, or something else we want to map an input
sequence to). This use of previous predictions as clues for generating new predictions is
highly relevant for text processing, as language, (and so by extension the tags we may use
to annotate language) is contextual: For example, certain POS tag sequences are more or
less likely, or even impossible, and knowing what was just predicted is very useful.
Let us again look at this process mathematically. The decoder is typically trained to predict
the next output given c and all the previously predicted outputs, { , . . ., } (which
could be, for example, POS tags, Named Entity labels, or target language tokens). It does this
by learning a probability over the output y = (y1, . . ., yTy), using
Equation 17.6
That is, the probability of the target language sequence equals the product (π) of the con
ditional probabilities of all output tokens, each which are determined given the previous
outputs and the context vector. Practically, each of these conditional probabilities are mod-
eled as
Equation 17.7
where g is a nonlinear function, such as a (possibly deep) RNN, which outputs the probabil-
ity of yt, and st is the RNN hidden state.
The original Seq2Seq model was used for performing neural machine translation. A four-
layer LSTM was used to encode an input sequence, such as “I like pizza EOS”, into one rep-
resentation, before a similar LSTM, conditioned on that representation, then computed the
probability of the sequence “Ich mag Pizza EOS”. You can think of these two sequences as
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
530 17 Natural Language Processing (NLP)
replacing the “A B C” and “W X Y Z” examples in Figure 17.6, but note that in their imple-
mentation, the decoder was actually made to read the input in reverse: “C B A”. This was a
clever trick which was found to improve performance, most likely because it reduces the
average distance between an input token and its corresponding output token (depending on
the two languages and their word order rules, of course). In later improvements, bi-direc-
tional LSTMs were used. These consist of two LSTMs, one processing the input in a forward
direction, and the other, working backward; the outputs of both are combined in some way
before being passed on to the remaining layers of the network, such as a final activation
layer. The benefit of so-called “bi-LSTMs” is that they can learn from context both before and
after the current token, which is beneficial in many languages.
In addition to impressive translation performance, a practical advantage of such end-to-end
models is that intermediate steps like word alignment and language modeling are no longer
required.
Neural Attention
Despite their early successes, sequence-to-sequence models face a problem when process-
ing long sequences: the need to compress the entire sequence into a fixed-sized context
vector, making it an information bottleneck. Furthermore, information about early tokens
will be progressively lost, as new tokens are encoded. This is especially problematic in the
case of long-distance dependencies.
Neural attention was introduced to resolve this. Let us explore it by walking through the
so-called “Bahdanau attention” architecture [9], which was first applied to an encoder-
decoder model for neural machine translation.
The encoder stage is as previously discussed, with the encoder progressively encoding all
input tokens, including the final “EOS” token, into hidden states, which are non-linearly
combined into a vector c. Importantly, note that this vector is allowed to be of a variable
length.
In the sequence-to-sequence model with attention, the decoder re-defines the conditional
probability from Equation 17.6 as
Again, here x is an input sequence, g is a nonlinear function, and y1, . . ., yi-1 are the previous
predictions. But now we have si, which is the decoder RNN hidden state for timestep i, and
we have ci, which is a distinct context vector for each target word yi. What does this mean?
Recall from Equation 17.5 that in a typical sequence-to-sequence model, c (just c, not ci) is
created by applying a nonlinear function to a series of encoder hidden states. Similarly, in a
model with attention, each ci depends on a sequence of so-called “annotations” (h1, . . ., hTx),
which contain the encoder’s representation of the entire input sequence. The crucial differ-
ence is that ci includes a particular emphasis on the input token i and its neighbors. How is
this emphasis achieved? Instead of using a forward RNN to encode each hi, input is pro-
cessed by both a forward- and backward-RNN, which each produce a series of hidden states,
which are then concatenated. Thus, each hi summarizes the tokens preceding and following
i, which is a huge benefit for natural languages, given that both contexts can heavily influ-
ence meaning. And since RNNs represent recent inputs better than older ones, each anno-
tation hi will be focused on the corresponding xi and its neighbors.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 531
Returning to the context vector ci, it is computed as a weighted sum of these annotations hi.
The weight is determined by an “alignment model”, a, which takes the current input anno-
tation hj and the previous decoder hidden state and computes a score for how well the inputs
around position j and the outputs around position i are aligned. In other words, in the align-
ment model, aij reflects the probability that yi is aligned to (that is, is a translation of), xj.
igure 17.7
F
An encoder-decoder model producing the t-th target word, yt,
given the source sequence (x1, x2, . . ., xT) [9]
Thus, the alignment model learns to align words in a manner similar to the word and phrase
alignment tables used in earlier, statistical machine translation (see Section 17.4.2). Each aij
reflects how important hj is with regards to the hidden state si−1 in generating the next hid-
den state, si, and the output, yi. For example, in Figure 17.8 on the left we see a learned
alignment based on attention. The gold standard alignment is shown on the right.
The alignment model is a fully connected feed-forward neural network (FNN): a simple
architecture in which all neurons in one layer are connected to all neurons in the next layer.
This FNN is trained jointly with the model and produces an output for the current timestep.
This is of course a practical benefit, as the historical, intermediate step of creating word
alignments, whether by hand or statistically, is no longer required.
Note that this section has described only one possible attention architecture. The interested
reader may wish to compare it with another proposal from the same period, known as
“Luong attention” [10]. Later, we will also see an example of attention in the famous “Trans-
former model”.
Figure 17.8 Word alignments learned by an attention mechanism (left) with the gold standard
alignments (right) [10]
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
532 17 Natural Language Processing (NLP)
large and consisting of mainly zeroes. This is impractical, as the information density consid-
ering the amount of storage required is low. Another issue with some of these approaches
is that input words are treated separately, thus failing to take advantage of common rela-
tions between words and the meaning they lend one another.
For these reasons, NLP practitioners began using dense vector representations of the input,
called embeddings. The input may be sliced into individual characters, character n-grams
(sub-words) or individual words, giving rise to the terms character-, sub-word- and word-
level embeddings. We will stick to word-level embeddings for our explanation.
Word embeddings are numeric representations of words, which have been arranged in dis-
tributed, semantic space according to characteristics of each word and its neighbors. The
arrangement follows the “distributional hypothesis”, which says semantically similar words
should appear in similar contexts, and dissimilar words should not. For example, the vectors
for “tiger” and “lion” should be more similar than those for “tiger” and “spoon”.
Word embeddings can be learned jointly with a model for a specific task, to maximize their
utility for that task. In this approach, the vectors are initialized randomly and thereafter
updated – via updating the weights in a neural network – to better reflect a mapping be-
tween the input and output signal. Unfortunately, this requires a significant amount of
labelled data and can limit the model’s generalizability to new tasks and domains. Thus, it
is more common to use pre-trained word embeddings, which can be trained on masses of
readily available, unlabeled data. The pre-trained embeddings are used to encode the input
for the NLP task at hand, simply by looking up the existing embedding for each input word
in the training data. The benefit of this is that the existing embeddings transfer their pre-
learned syntactic and semantic knowledge to the new task, making this an early example of
transfer learning in NLP.
Two famous implementations of non-contextual word embeddings are Word2Vec and GloVe.
Let’s take a look at each.
Word2Vec
Word2Vec [11] are among the earliest pre-trained word embeddings. They are learned using
simple neural networks in one of two architectures: continuous bag-of-words (CBOW) and
skip-gram (Figure 17.9). The CBOW model is trained to predict a target word given a k-sized
context window of past and future words around it, without considering word order. Con-
versely, the skip-gram model has to predict a k-sized context window given an input word.
After each prediction, the word embeddings vectors are adjusted based on the prediction
error until the model learns to encode semantic and syntactic relationships.
Word2Vec was highly successful and popularized the use of pre-trained embeddings for
NLP tasks. Yet the use of a small, local context window could limit the subtlety of learned
word meanings and even cause semantically opposing words with similar typical contexts,
such as “good” and “bad”, to be clustered together. Thus, Global Vectors (GloVe) embeddings
were proposed.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
534 17 Natural Language Processing (NLP)
Figure 17.9 The two architectures proposed for learning Word2Vec word embeddings: continuous
bag-of-words (left) and Skip-Gram (right)
Figure 17.10 Example co-occurrence matrix for tokens in a simple, one sentence dataset
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 535
Next, a log-bilinear model is trained with a weighted least squares regression objective to
produce word vectors as follows:
where
V is the size of the training set vocabulary
Xij is the number of times word i occurs in the context of word j
wi and bi are the word vector and bias for i
and are the word vector and bias for j
f(Xij) is a weighting function which should assign relatively low weight to both very rare
and very frequent co-occurrences.
The final vectors will have the property that the dot product of any two words’ vectors will
equal (or be close to equal with) the logarithm of their co-occurrence probability. The goal
of the regression objective is to minimize this difference. Since the logarithm of a ratio
equals the difference of logarithms, these logarithms of ratios of co-occurrence probabilities
can be directly related to vector differences in semantic space.
A problem with both GloVe and Word2Vec (at least in their original implementation) is that
word-level embeddings fail to take advantage of morphological and word shape information,
such as suffixes and capitalization, which could be useful for tasks like Part-of-Speech Tag-
ging or Named Entity Recognition. This also means that, since it is not feasible to store
embeddings for a language’s entire vocabulary (or miss-spellings), they cannot handle OOV
words (out-of-vocabulary input words will be represented as zeroed vectors or in some other
principled way, such as using the mean of the known words’ vectors). This is a barrier for
generalization and real-world usage, such as in general web- and particularly social-media
text.
A greater problem, though, is that GloVe and Word2Vec are non-contextual: each word only
ever receives one embedding, regardless of its usage in various contexts. Thus, we cannot
model polysemy – when one word has multiple different, possible meanings – and use this
knowledge to resolve semantic ambiguities. This can make even cases as simple as the fol-
lowing, rather tricky:
1. “She likes to box” versus “She likes the box”.
2. “He likes to cook” versus “He likes the cook”.
In sentence 1., if we were to look only at the non-contextual embedding of “box”, it would be
hard to know whether to translate this into German as “boxen” (the substantive verb) or
“Box” (the noun). Of course, the surrounding context will help the model disambiguate, but
there may be other cases where this is not so clear. A similar ambiguity arises in 2.: should
we translate “cook” into “kochen” (the substantive verb) or to “Koch” or “Köchin” (the male
and female versions of the noun, respectively)? Contextual word embeddings dramatically
assist in these situations, as we will see with the ELMO model, below.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
536 17 Natural Language Processing (NLP)
ELMO uses character-level representations and whole sentence contexts in order to learn a
different embedding for each word in each of its contexts, and to be able to represent OOV
words by understanding subword units. Additionally, its three-layer architecture enables it
to capture non-linear relations between inputs and to create deep representations, which
can help in disambiguating polysemous words. The training process is as follows:
1. We start with a large, unlabeled text dataset, and use n-gram convolutional filters to cal-
culate context-insensitive character embeddings. Recall from our discussion on Word-
2Vec and GloVe that “context-insensitive” means each n-gram receives a single embed-
ding, regardless of the various contexts in which it appears.
2. The character embeddings are fed to L stacked bidirectional LSTMs (biLSTMs), which
calculate context-dependent word representations by jointly maximizing the log likeli-
hood of the word sequence both forward and backward. The LSTMs get their bidirection-
ality by concatenating the features of two unidirectional (left-to-right and right-to-left)
models into a bidirectional language model (biLM):
The Transformer
The Transformer [15] was designed to solve sequential data tasks, such as NLP, while han-
dling long-range dependencies. While many variations exist, this section describes the orig-
inal architecture, starting with the overall plan and then detailing the Transformer’s imple-
mentation of neural attention.
Input: Input sequences are represented numerically (such as via byte-pair encodings [16])
and fed to the model along with positional encodings. The latter are required for making use
of word order, since the Transformer architecture is non-recurrent (contrast this with an
RNN, which processes the input sequentially and implicitly maintains that order through-
out). So for learning a task like German to English translation, the encoder will receive the
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
538 17 Natural Language Processing (NLP)
German sentence (embedded in some preparatory, numeric manner) and positional embed-
dings. The decoder will receive the English translation (and positional embeddings).
Encoder: The encoder consists of N=6 stacked, identical layers, each of which contains a
multi-headed self-attention mechanism and fully connected feed-forward neural network
(FNN). The attention mechanism generates an encoding for each input token, which, in our
translation example, means generating a numeric vector for each token in the German input
sequence. This output undergoes layer normalization [17] before being fed to a fully con-
nected feed-forward network. The FNN output is normalized again and then fed to the next
encoder layer. This process generally repeats for all N=6 encoder layers, however, note that
residual connections around the self-attention mechanism and FNN allow either one to be
skipped at any point.
Decoder: Like the encoder, the decoder also consists of N=6 decoder layers, each containing
a self-attention layer and fully connected FNN. Again, residual connections are deployed
around each sub-layer and layer normalization is applied thereafter. The difference from the
encoder is that each decoder layer features an additional sub-layer, the “encoder-decoder
attention layer”, which performs multi-headed attention over the entire encoder output (in
our example, the encoded German sentence).
In the final decoder output layer, the FNN projects the incoming vector into a “logits” vector
of dimensions equal to the training data vocabulary size. A softmax probability is generated
for each possible output token, and then for whichever index in the vector has the highest
probability (the argmax), the corresponding token in the vocabulary is taken as the final
output. In our example, this would be the English word the Transformer has predicted as
coming next in the translation. The loss between this token and the true next word is then
calculated and backpropagated through the entire network, and training continues.
In Detail: Transformer Attention: Calculating Transformer attention begins with linearly
mapping the input tokens to three matrices: the queries (Q), keys (K), and values (V) matrix.
Then, each input token in query matrix Q will be compared, simultaneously, to all keys
within key matrix K and value matrix V. Each comparison between a query and a key results
in a similarity value, and softmax is applied afterward to all of these similarity values (recall
that the softmax function scales values to the range [0,1] and ensures that they sum to one).
The resulting scores will tell the model how much to attend to the corresponding value for
each key. For example, if we are encoding the sequence “The student passed her physics
exam because she studied hard for it”, then the token “student” is highly relevant to encod-
ing “she” (and should thus receive more attention), while “exam” is relevant for encoding
“it”. Having calculated these scores, we multiply them with V to produce a final vector as the
output of this attention layer.
Note that even from the first pass, the Q, K & V matrices will have been embedded as ran-
dom, numeric vectors. As with any neural network, the purpose of training is to repeatedly
adjust these embeddings until they represent each token’s semantics under different syn-
tactic and semantic contexts.
Scaled Dot-Product Attention: The formula for Transformer attention is denoted:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.4 NLP Tasks and Methods 539
Putting this into words, the Transformer takes the matrix product of Q and (a transpose of)
K and scales this via multiplication with , which should prevent the softmax gradient
from becoming vanishingly small. This value is normalized via the softmax function and
finally multiplied by V.
Self-Attention: In the encoder self-attention layers, the Q, K, and V matrices are all derived
from the previous encoder layer. This enables the encoder to attend to all positions from the
previous layer for generating an encoding. The decoder self-attention layers work identi-
cally, taking the output from the previous decoder layer. The additional encoder-decoder
self-attention layer within the decoder works slightly differently, performing multi-headed
attention over the entire encoder output: It does this by taking its queries from the previous
decoder layer (that is, the queries are the previously emitted English tokens), and taking its
key and value matrices from the output of the encoder stack. Additionally, a small change is
made in calculating the scaled dot-product attention here: future tokens are “masked”, by
setting the inner products at those positions to -∞, so that the decoder cannot peek at future
tokens when producing its output.
Multi-Headed Attention: The role of a single attention head is to project its input into dis-
tributed semantic space (see the discussion on non-contextual embeddings, Section 17.4.4)
using a learned, linear transformation. The role of multi-headed attention is to provide the
attention layer with multiple “representation subspaces” in which to represent subtle differ-
ences of meaning based on context. Thus, each encoder and decoder layer features a multi-
headed attention layer with h=8 attention heads. The attention function is calculated in
parallel for each of these (as per the “Scaled Dot-Product Attention”, above), resulting in h
value matrices V. These output values are concatenated and multiplied with a trainable
output matrix, thus summarizing those h output matrices into a single matrix.
tence in a sentence pair follows the first or is a random replacement. The benefit of both
techniques is that the model can use the entire bi-directional context around each token, to
build that token’s context-aware embedding.
BERT was also bigger than the original Transformer: for example, the small version, “BERT
Base”, featured L=12 encoder-decoder layers and H=12 attention heads, while in “BERT
Large”, L=24 and A=16. Altogether, its novel features and innovative training regime helped
BERT set a new state-of-the-art in various NLP challenges. Furthermore, the code and pre-
trained models were open-sourced, allowing BERT to become a great democratizer of the
power of AI for NLP.
sentiment analysis. This kind of model requires labelled data, which usually comes at a
significant cost. A generative model learns a probability distribution over a dataset and uses
this to return the likelihood for a given output, such as a token sequence in language mod-
elling. This can be done in an unsupervised manner using easily available unlabeled data,
such as text scraped from the web.
The aim of the original GPT model was to use generative unsupervised pre-training (with a
language modelling objective) to learn a universal representation of language, then follow
that with discriminative fine-tuning. This should ease the data scarcity problem, while still
offering the benefits of transfer learning, where the general pre-training can provide a per-
formance boost for downstream tasks. In fact, this proved highly successful: on numerous
NLP tasks, GPT outperformed discriminative models with architectures specifically de-
signed and trained for those tasks.
Let us examine this process in a little more detail. In Section 17.4.4, under “Language Model
Pre-Training”, we saw how GPT uses a standard language modelling objective to conduct its
unsupervised pre-training. The language model itself consists of a multi-layer Transformer
decoder: that is, the decoder’s multi-headed self-attention and feed-forward neural network
layers are used to produce an output distribution over all possible output tokens. The model
itself is trained via stochastic gradient descent. The fine-tuning stage requires a labelled
dataset C, which consists of input tokens x1, . . . xn, and a corresponding label y. The input
tokens are passed through the pre-trained model, and the final Transformer block’s activa-
tions are fed to a linear output layer with softmax activation to predict an output label y.
Language modelling is introduced again as an auxiliary objective during fine-tuning, as this
was found to assist generalizability and accelerate model convergence.
What is interesting about this entire process is how it takes inspiration from the various
other architectures outlined above. For example, it follows BERT’s approach of using unsu-
pervised pre-training plus supervised fine-tuning, yet is similar to ELMO in the sense that
ELMOs representations are fed as inputs to a new architecture. However, unlike ELMO, it
uses Transformers and self-attention rather than BiLSTMs. And again, like BERT, GPT’s in-
puts can be adjusted to handle sentence pairs (and even triplets), in order to solve a variety
of tasks such as textual similarity and multiple-choice question answering.
GPT was first introduced in 2018. In 2019, GPT-2 was announced as a bigger and dramati-
cally more powerful version, capable of producing realistic-sounding, human-like texts. It
also excelled at zero-shot learning: a machine learning task in which a learned model must
make predictions for inputs that were never seen at test time (see also Section 17.4.4). In
fact, GPT-2’s creators, OpenAI, declined to open-source the project, citing concerns over the
potential for it to spread false information and clutter the web with textual noise. After sig-
nificant controversy over this decision, however, the model was eventually publicly released.
In 2020, OpenAI announced GPT-3, which gained immediate attention thanks to the stag-
gering 175 billion parameters of the full version. While GPT-3 exceeded the high-perfor-
mance benchmarks set by its predecessor, this came at a much greater compute cost. Once
again, the model was not released publicly, instead being made available first only to sub-
scribers and then exclusively to Microsoft. This, too, attracted some criticism, as it goes
against recent trends for open sourcing these incredibly powerful, pre-trained models.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
17.5 In a Nutshell 543
While each new GPT version made huge headlines in the NLP and AI worlds,
it wasn’t until November 2022, when OpenAI released ChatGPT, that these
models truly became household names. Providing a user-friendly chat inter-
face to the underlying model enabled OpenAI to reach hundreds of thou-
sands of new users from all walks of life, sparking a stampede of experimen-
tation and development based on this new kind of conversational language
technology. In case you missed it, Chapters 15 and 16 discuss exactly how
ChatGPT was trained, how Large Language Models like GPT lend themselves
to diverse Generative AI modalities, and how you can safely and successfully
use these technologies in both private life and production projects.
■ 17.5 In a Nutshell
References
[1] Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python.
O’Reilly Media Inc.
[2] Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, David
McClosky. “The Stanford CoreNLP Natural Language Processing Toolkit.” Proceedings of the 52nd
Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014,
https://aclanthology.org/P14-5010
[3] Santorini, Beatrice. “Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd revi-
sion).” Technical Reports (CIS). University of Pennsylvania, Philadelphia, 1990, https://repository.
upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports.
[4] “Agglutination.” Encyclopædia Britannica. Encyclopædia Britannica, inc. https://www.britannica.
com/topic/agglutination-grammar.
[5] Bonn, Julia, and Nianwen Xue. “Proceedings of the Fourth International Workshop on Designing
Meaning Representations.” In Proceedings of the Fourth International Workshop on Designing
Meaning Representations. 2023.
[6] Forney, David. “The Viterbi Algorithm.” Proceedings of the IEEE 61 (3), 1973, http://dx.doi.
org/10.1109/PROC.1973.9030.
[7] Pal, Santanu, Sudip Kumar Naskar, and Sivaji Bandyopadhyay. “A hybrid word alignment model
for phrase-based statistical machine translation.” In Proceedings of the Second Workshop on Hybrid
Approaches to Translation, pp. 94–101. 2013.
[8] Sutskever, I., Vinyals, O., Le, Q. V. “Sequence to Sequence Learning with Neural Networks.” Ad-
vances in Neural Information Processing Systems, 2014, arXiv:1409.3215.
[9] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. “Neural machine translation by jointly
learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).
[10] Luong, Minh-Thang, Hieu Pham, Christopher D. Manning. “Effective approaches to attention-based
neural machine translation.” arXiv:1508.04025 (2005).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 545
[11] Mikolov, Tomas, Kai Chen, Greg Corrado, Jeffrey Dean. “Efficient Estimation of Word Represen
tations in Vector Space.” arXiv preprint, arXiv:1301.3781 (2013).
[12] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. “Glove: Global Vectors for Word
Representation.” Proceedings of the 2014 conference on empirical methods in natural language pro-
cessing (EMNLP), 2014, https://aclanthology.org/D14-1162.pdf.
[13] Radford, Alec, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. “Improving Language Under-
standing by Generative Pre-Training.” Preprint, 2018, https://s3-us-west-2.amazonaws.com/open
ai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[14] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee,
Luke Zettlemoyer. “Deep contextualized word representations.” arXiv preprint, 2018, arXiv:
1802.05365v2.
[15] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łu-
kasz Kaiser, Illia Polosukhin. “Attention is All You Need.” Advances in neural information process-
ing systems, 2017, https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845
aa-Paper.pdf.
[16] Sennrich, Rico, Barry Haddow, Alexandra Birch. “Neural Machine Translation of Rare Words with
Subword Units.” arXiv preprint, 2105, arXiv:1508.07909.
[17] Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E Hinton. “Layer Normalization.” arXiv preprint, 2016,
arXiv:1607.06450.
[18] Devlin, Jacob, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. “Bert: Pre-training of Deep Bi
directional Transformers for Language Understanding.” arXiv preprint, 2018, arXiv:1810.04805.
[19] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,
Maxim Krikun et al. “Google’s neural machine translation system: Bridging the gap between hu-
man and machine translation.” arXiv preprint, 2016, arXiv:1609.08144.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Computer Vision
Roxane Licandro
Figure 18.1 Visual Computing – An interplay of three major disciplines: Computer Graphics,
Computer Vision and Image Processing. (Image courtesy Roxane Licandro)
Related techniques work on image level to process images in such a way – that they are
more interpretable for a specified Computer Vision task. Image processing requires as input
an image and provides a processed image as output using techniques like:
image enhancement: saturation or brightness correction, noise reduction,…
image compression: JPG compression (lossy), PNG compression (lossless),…
feature detection: edge or interest point detection,…
image combination: image panorama stitching, multi-modal image registration,…
Image processing supports the extraction of suitable image descriptors for Computer Vision
related tasks and processes.
The applications of Computer Vision are for example…
industry related image processing systems: recognition and identification of objects on
a conveyor.
clinical systems: tumor delineation and size measurement in medical images for surgery
and treatment planning.
identification systems: mobile phones using face recognition for the owner’s identity
verification to unlock the phone.
image based retrieval systems: to retrieve similar images from a database which match
a query image.
systems in the research field of artificial intelligence: robot and computer systems,
which can perceive, understand and consequently interact with their environment like
self-driving cars.
In this chapter we will first introduce the versatile image acquisition techniques and under-
lying physical models (Section 18.2). Since Computer Vision focuses on developing and in-
vestigating techniques to reproduce human vision, we will also have a closer look at cases
when this system fails. We will talk about how images can be perceived differently by hu-
mans (optical illusions) and we will have a closer look at how computers can be tricked in
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.2 A Picture Paints a Thousand Words 549
the image capturing and analysis processes. In Section 18.3 we will provide a brief overview
of image processing techniques, which are frequently used to support Computer Vision
techniques. Image compression routines and computational photography techniques, are
also discussed. In Section 18.4 we will have a closer look at Computer Vision applications
and future directions in this research area and in Section 18.5 we will conclude with a dis-
cussion on how the cycle of vision closes, by briefly introducing techniques to make humans
understand what computers see to solve Computer Vision tasks.
The eye is an organ which reacts to light stimuli and enables the human to see. If visible
light arrives in the eye, information from the environment is perceived. The visual system
consists of several components which are depicted in Figure 18.2: The lense of the eye is
used to get a focused image of the environment, which is further projected up-side down
onto the light sensitive membrane (retina) at the back of the eyes. The retina consists of
photoreceptors which are excited by light and trigger the generation of neural impulses by
chemical and electric processes. Subsequently, these impulses are transmitted to brain ar-
eas for further processing. The retina consists of several layers of neurons, which are con-
nected via synapses. There are two types of light sensitive photoreceptor cells: rods (~100
million) and cones (~5 million). Rods work at a low level of light (night vision), cannot dis-
tinguish colors, but make us see grayscale images in low-light conditions, i.e. are responsi-
ble for monochrome vision. The cones are not as light sensitive as rods and are involved in
the color seeing process. Three different types of cones exist (trichromatic vision), which
respond to different wavelengths of the visible light (long-wavelength (L cones)): 564–580
nm, middle-wavelength (M cones): 534–545 nm, and short-wavelength: 420–440 nm (S
cones)) and are seated around the fovea centralis.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
550 18 Computer Vision
Figure 18.2 Schematic illustration of the human eye and its components. (Image courtesy Roxane
Licandro)
Color is perceived through the combination of stimuli of cone cells and the difference in the
signals received from the cone types. This process ensures color constancy, which results in
the effect that under varying illumination conditions, color is perceived relatively con-
stantly. The luminance perception of the human retina uses M and L cone cells combined
during daylight vision, which are most sensitive to green light. The fovea centralis lies in
the center of the macula lutea, the sharpest area of vision. Around 50% of nerve fibers prop-
agate impulses exciting at the fovea to the corresponding brain areas, the remaining 50% of
fibers forward impulses from retinal receptors [2].
Figure 18.3 Examples of optical illusions: Illusions regarding the length (left), regarding the gray
value (middle) and regarding the size (right). (Image courtesy Roxane Licandro)
These are only a few examples of optical illusions, which make the human vision system fail
or interpret images incorrectly (Please see [https://michaelbach.de/ot/] and [5] for more
interpretations and examples). This raises the question: Do we want machines to see like
humans? Thus, when creating Computer Vision systems, we have first to reflect on the ques-
tion: what should a machine see? And how can we make it see that? And which input does
the machine understand? Therefore, in the next chapter we will have a closer look at how to
capture images from the environment and what should be paid attention to in the imaging
process.
The development of the pinhole camera (camera obscura) in the 13th century was one of the
inventions which led to photography. At the time, painters used it to obtain a 2D projection
of a real world object or scene on a paper, which was geometrically correct, to be able to
paint more details. In Figure 18.4 on the left the basic principle of a pinhole camera is visu-
alised, which consists of a closed dark room with a pinhole on the front side and an image
plane on the opposite side. Light beams originating from an object point (apple), which go
through the pinhole, are linearly projected onto the image plane as a smaller up-side down
image of the visible scene. By using flipped mirrors the position and orientation of the im-
age on a piece of paper on a glass surface is adjusted [3].
Figure 18.4 The ideal pinhole camera model for perspective projection. (Image courtesy Roxane
Licandro)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
552 18 Computer Vision
The underlying model of the ideal pinhole camera describes the mathematical relation be-
tween a three dimensional point and its corresponding projection onto the two dimensional
image plane. Since no lense is used, the camera aperture overtakes the role of focusing the
light and can be denoted as a point O. This camera obscura is the simplest acquisition de-
vice, which reflects the geometry of perspective projections exactly. This model is used for
the simple approximation of the projection of a 3D scene to a 2D image. The following as-
sumptions are made for the introduction of the equation for perspective projection: (1) The
center of projection O overlays with the origin of the camera coordinate system, (2) the op-
tical axis is oriented along the z-axis of the camera, (3) the image plane I(x,y) lies in front of
the center of projection to avoid a flip of the image. The schematic illustration for these as-
sumptions are provided in Figure 18.4 on the right. The focal length f denotes the distance
between the image plane and the center of projection (in digital cameras this would be the
distance between the lense and the photo sensor). The optic axis is the line going through
O, orthogonal to the image plane. Z denotes the distance between O and the object in z direc-
tion, X the distance in x direction and Y in y direction respectively. The perspective projec-
tion of the 3D point P(X,Y,Z) to the two dimensional coordinates x and y on the image plane
I(x,y) can be computed using the triangular relations visualised in Figure 18.4 on the right,
resulting in the following Equation 18.1 for perspective projections:
The perspective projection has the following properties: it is a non-linear projection, where
in general the three dimensional information gets lost. It is defined as a one to n mapping,
since several 3D points can be mapped onto the same position on the image plane, three
dimensional lines are projected onto two dimensional lines (exceptions are lines parallel to
the optical axis), distances and angles are not preserved and parallel lines are not projected
onto parallel lines (exceptions are lines parallel to the image planes) [3].
Image Sharpness
For being able to obtain a sharp projection, the pinhole would have to be as small as possi-
ble, leading to less light being able to pass through. This consequently would require an
increase of exposure time, which in reality is not feasible. To overcome this, lenses and
lense systems are used to improve the imaging quality, but consequently lead to a more
complex projection. Rays from an object are refracted by the lense and collimated into a
single focal point. If the object lies at a defined distance u away from the lense, a focused
image is projected behind the focal point at a distance v on the image plane. In Figure 18.5
these relations are visualised on the left. It can be expressed by the simple lense equation:
According to the definition of Equation 18.2, object points only with a specific distance v
from the camera/lense are imaged in a sharp way, while the rest is blurred. In reality this
distance is a range (also called Depth of Field (DOF)), which is defined by the distance be-
tween the closest and farthest object points which can be sharply visualised. The DOF is
influenced by the resolution of the camera sensor, the thickness of the lense and the size of
the aperture. The aperture lies in front of the lense, and is a circled opening which deter-
mines, depending on its diameter, the angle between rays and the image plane. The smaller
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.2 A Picture Paints a Thousand Words 553
the diameter, the fewer rays hit the image plane, the darker is the image for a given expo-
sure time, the higher is the DOF. Beside the aperture, the lense also influences the DOF: the
bigger the lense is, the higher is the amount of rays of an object point which hit the image
plane. Neighbouring object points outside the DOF range are visualized as overlapping cir-
cles, which consequently lead to an unsharp image. Thus, the range in which a sharp image
can be created is decreasing with increasing size of the lense, since the size of the overlap-
ping circles increases with increasing lens size. The resolution of the sensor also influences
the DOF. The higher the resolution of the sensor, the better the unsharp circles can be im-
aged, this means the DOF becomes smaller. Also with an increasing focal length of the lens,
the DOF decreases.
Figure 18.5 Schematic illustration of the projection principle for cameras with an ideal lens (left)
and visualization of the depth of field effect (right). (Image courtesy Roxane Licandro)
Figure 18.6 Schematic illustration of the Aliasing effect occurring in the case of an undersampled
signal. (Image courtesy Roxane Licandro)
If the sampling rate chosen is smaller than this frequency, that is, the signal is undersam-
pled, the so-called “aliasing effect” occurs. This is where the original signal is falsely repre-
sented as a low frequency signal. An example of this effect and a well sampled signal is
given in Figure 18.6.
Figure 18.7 View of the city of Perm from the railroad bridge across Karma. (Images taken from the
Prokudin-Gorskii Collection at the Library of Congress, published in 1910, Creator Prokudin-Gorskii
[1863–1944], https://www.loc.gov/pictures/collection/prok/).
The most popular technique for acquiring color images uses a Color Filter Array (CFA) (also
called Color Filter Mosaic (CFM)). Here, the pixel sensors have a collection (mosaic) of small
color filters which are positioned in front of these sensors. Since photo sensors detect light
intensity only, the filters enable the specification of the wavelength range (color) of the light
hitting the sensors. The most common filters used are Bayer filters, which were invented by
Bryce E. Bayer of Eastman Kodak. A Bayer pattern consists of 50% green light range sensi-
tive filters and 25% red and 25% blue light range sensitive filters. That is, every sensor pixel
has either a red, green or blue filter and is therefore capable of recording one color only. The
imbalance between the proportion of sensitivity per color is necessary due to the increased
sensitivity of the human eye to green light.
Figure 18.8 Schematic illustration of the demosaicing procedure of Bayer pattern based color im-
age acquisition. (Image courtesy Roxane Licandro)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
556 18 Computer Vision
In Figure 18.8 schematic examples of Bayer pattern types are visualised, while for the pat-
tern GRGB the corresponding demosaicing routine is shown in detail. In the first step the
color channels (red, green, blue) are separated and subsequently in a second step the miss-
ing information is obtained by a defined interpolation scheme, which varies upon different
camera manufacturers [6].
After light rays (which may be filtered or unfiltered, depending on the acquisition proce-
dure) hit the sensor, the signal is digitalized to obtain a quantized value for every pixel (in
2D) or voxel (in 3D). A pixel/voxel is the smallest individual element in a 2D/3D digital im-
age and encodes the brightness measured by the photo sensor. A digital image is repre-
sented as a finite numeric set.
The storage requirement of a two dimensional digital image with 2B gray levels and c color
components of size N x M (where N is the number of pixels in a row, M the number of pixels
in a column) can be computed using Equation 18.3. Thus, for a bitmap image of size 1024 x
768 with 3 color components and 256 levels (28 Bits = 1 Byte) the image size is 2.36 MB.
Two dimensional images can be categorized into two types: (1) vector images or (2) raster
images (e.g. bitmap). Digital images have a fixed number of rows and columns (and slices,
in case of 3D images) of pixels and are represented as a finite set of digital values. For mak-
ing images accessible in a long term, image file formats were defined to be able to store
them in a standardized way. Here, we will briefly discuss the most popular image formats,
although more than hundreds exist.
Computer Graphics Metafile (CGM): The CGM data format was designed for 2D vector
graphics, raster graphics and text and it works for every operating system. A textual
source file is used to specify all graphical elements, which can be compiled in three dif-
ferent ways using binary encoding (for efficient data reading), character-based encoding
(if a small file size is required) or clear text encoding (if the content of the file should be
modifiable and accessible with a text editor). This file format is object oriented and con-
tains information for image rendering. Additionally, in a metafile information is provided
for the description and specification of graphical information and geometric primitives.
Windows Meta File (WMF): This file format was designed in the 1990s for Microsoft
Windows operating systems with the focus on making images transferable between appli-
cations and incorporating vector as well as bitmap components.
Owing to the large number of images required to train and setup Computer Vision systems,
image compression is often employed to reduce image size, leading to efficiencies in storage
and transmission. Image compression focuses on reducing redundancy in imaging data and
can be categorized into lossy or lossless compression approaches [3].
Lossy Compression
This compression type discards data in the compression process. It is built upon the trans-
form coding principle, meaning it transforms raw data into a domain that more accurately
represents the content. Repeated compression and decompression causes the images to
suffer from generation loss, also called progressive quality loss. For this reason, lossy com-
pression approaches are especially suitable for natural images (photographs). For such pho-
tos, some minor loss of fidelity is acceptable in order to achieve a substantial reduction in
file size. There also exist lossy compression techniques which seem visually lossless, since
they are capable of producing compressed images of imperceptible differences to the origi-
nal one.
Image formats that use lossy compression:
Joint Photographic Experts Group (JPEG) compression: This is the most popular lossy
compression standard for 2D digital images. The adjustment of the compression degree
allows one to balance storage size and image quality. Without visual impairment, JPEG
compression reduces an image down to a fifth of its original size. However, note that the
higher the compression degree, the higher the probability that artefacts (specifically
block formation) will occur. JPEG is a suitable compression strategy for natural images,
but not for computer graphics or medical data.
Discrete Cosine Transform (DCT) compression: This compression approach works by
expressing a sequence of many image data points as a sum of oscillating cosine functions
at different frequencies. Since the human eye is weak in detecting the strength of high
frequency brightness variations, DCT compression reduces the information amount in
high frequency components for reducing the image size on the one hand and to obtain a
visually lossless appearance of the compressed image on the other.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
558 18 Computer Vision
Lossless Compression
The defining property of lossless compression algorithms is that the original uncompressed
image can be exactly reconstructed from the compressed data. Thus, lossless compression
is often used for medical imaging, since loss of image information or the introduction of
artefacts (as in lossy compression) could lead to false conclusions in the diagnoses process.
This compression technique is also preferred, for example, for technical drawings, comics or
archival purposes. The primary lossless compression techniques, from least-to most-fre-
quently used, are Run Length Encoding (RLE), Huffman coding, and the most well-known
method Lempel Ziv (LZ) Dictionary-based Encoding. More details regarding these tech-
niques can be found in [3].
Image formats that use lossless compression:
Graphics Interchange Format (GIF): This format is a bitmap image format introduced
in the 1980s and still widely used, thanks to its wide support and portability. For a single
image 256 distinct colors can be encoded, equating to eight bits per pixel. It is suitable for
small images such as icons and simple animations.
Portable Network Graphics (PNG): The PNG format arose as a replacement of the GIF
format, since it does not require a patent license and is transferable via the internet. PNG
can encode colors in palette-based images with 24 bit RGB or 32 bit RGBA, full-color non-
palette-based RGB[A] images and grayscale images. PNG does not support non-RGB color
spaces (CMYK) and thus is not suitable for professional-quality print graphics.
Tagged Image File Format (TIFF): This image file format was created by Adobe Systems, who
wanted a standard format for images obtained by desktop scanners. TIFF is a flexible, adapt-
able file format for handling images (compressed with lossy or lossless compression) and
meta-data (image size, definition, image-data arrangement, applied image compression or
the image’s geometry) within a single file. Header tags can also be incorporated into the file.
Thus, TIFF is popular for photographers, graphics artist, and in the publishing industry.
We can differentiate between local and global features: local features describe a set of small
image regions (e.g. the mouth, nose and eye region in an image of a face), and global fea-
tures describe the whole image (e.g. the gray value histogram). For obtaining local features,
the first step is to detect interest points (or keypoints) and determine the scaling. In the
second step, the extracted features are used to compute a feature vector, which character-
izes the local visual appearance.
In this section we will briefly discuss specific image feature extraction techniques, span-
ning from edge detection and interest point detection to approaches of computational pho-
tography like image inpainting, image warping or HDR.
Edge Detection
Edge detection plays an important role in the field of Computer Vision. Feature extraction
using edge detection involves identifying points in a digital image at which the image
brightness has discontinuities. These sharp changes in image brightness reflect property
changes of the world and capture events of interest, which can be the following:
Discontinuities in depth
Discontinuities in surface orientation
Changes in texture or material properties
Variations in illuminations of a scene
The application of an edge filter to an image provides in the ideal case the boundaries of
objects or surface markings, represented by a set of connected curves. In Figure 18.9 an
example of edges which have been estimated by the Sobel edge detector is shown.
Figure 18.9 Example of detected edges (right) by applying a Sobel filter to the image on the left.
(Image courtesy Roxane Licandro)
Edge detection can be thought of as a filtering process, since it reduces the amount of data
that has to be processed by removing less relevant image properties. Edge detection ap-
proaches can be categorized into two main groups:
1. Search-based Methods: These approaches interpret edges as extrema of intensity
change in an image. This measure of edge strength can be accessed by computing the
gradient magnitude and subsequently search for local directional gradient magnitude
maxima by computing the gradient direction.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
560 18 Computer Vision
Figure 18.10 Examples of interest points (red crosses) detected by the Harris Corner Detection
approach (right). The original image is visualized on the left. (Image courtesy Roxane Licandro)
Most corner detectors are not only sensitive to corners: local image regions with a high de-
gree of variation in all directions (e.g. line endings) are also detected and called “interest
points”. To be able to extract corner points, local analyses have to be performed to determine
“real” corners. Examples of Corner Detectors are the Moravec Corner Detector and Harris
Corner Detector, while a common interest point detector is the Scale Invariant Feature
Transform (SIFT) [3].
Tone Mapping
Tone-mapping techniques were developed for displaying HDR images on devices with lower
dynamic range, such as print-outs, CRT, LCD monitors and projectors. It addresses the prob-
lem, by reducing the overall contrast, while still preserving local contrast (image details and
the appearance of color).
Photo Composition
Photo composition (also called photo montage) is a technique to create a composite image by
combining, cutting and joining different photographs. It enables the creation of an image
illusion by simulating a scene with different visual elements from separate sources. It is
often used in the entertainment sector in a process known as “green screening” or “blue
screening”: actors and other props are filmed in front of a uniformly green- or blue-coloured
screen, which makes it easier to extract those visual elements and apply further processing
to them, such as computer generated special effects.
Image Inpainting
Inpainting is a technique to reconstruct missing or deteriorated parts in images or videos by
using interpolation. The missing parts can be caused either by the absence of an object in
the actual natural scene (such as in aged, deteriorated paintings); by the introduction of
artefacts during the acquisition; or by artificially removing items from a scene. There are
various Computer Vision applications to solve this task with sophisticated algorithms:
Reverse deterioration (e.g. removing cracks and scratches in photographs, dust spots in
films)
Object removal or addition (e.g. removing stamped dates, adding occlusions for confiden-
tiality reasons)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
562 18 Computer Vision
The aim of inpainting is to provide a modified image, in which the inpainted region is
merged seamlessly in such a way that viewers are not aware of the modification. In Fig-
ure 18.11 an example of image inpainting is provided using an approach proposed by
NVIDIA1. This approach is capable of inpainting missing regions (white blobs). Although
the inpainted regions show deviations from the original image, they provide a seamless re-
construction.
Figure 18.11 Inpainting technique developed by NVIDIA. First row: original image (left), image with
removed parts (middle), inpainted result (right). Second row: detailed views of original (right), dam-
aged (middle) and inpainted image regions (right). (Image courtesy Roxane Licandro)
Image Warping
This technique enables the manipulation of an image such that any visible shapes are sig-
nificantly distorted. A set of points in a source image are mapped (warped) to a correspond-
ing set of points in the warped source image, without changing the color. If the defined
mapping function is injective (for every point in the original image exists a mapping to a
point in the warped image), a reconstruction of the original image from the warped image
is possible. In the bijective case (for every point in the original image exists a mapping to a
point in the warped image, but also for every point in the warped image exists a point in the
original image) the obtained mapping is invertible. Examples for warps are:
Linear mappings (spatial displacements)
Projective mappings (perspective transformation or homography): do not preserve sizes
or angles. Applications are: image rectification (for removing image distortion), image
1
https://www.nvidia.com/research/inpainting/index.html [accessed 14th of September 2021]
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.3 I Spy With My Little Eye S
omething That Is . . . 563
registration (matching of two images for comparing these), camera motion computation
between two images of a scene.
Affine mappings (translation, geometric contraction, expansion, dilation, reflection, ro-
tation, shear and similarity transformations and combination of these): preserve straight
lines and ratios of distances between points lying on a straight line, but does not preserve
angles or lengths.
Figure 18.12 Illustration of image warping techniques. The source image is shown on the left, the
perspectively transformed image in the middle and the affine transformed image (scaled) on the
right. (Image courtesy Roxane Licandro)
Image Morphing
This technique has been developed to seamlessly change (morph) one image into another by
a defined transition function. Its application has become popular, for example, in fantasy
motion pictures or animations to depict people turning into other people or animals. The
technique is a mixture of transforming the source image to the target image using defined
key points and simultaneously fading out the source and fading in the target image. A more
sophisticated fading technique is called cross-fading, where different image parts are grad-
ually transitioned instead of the entire image.
Image Stitching
Image stitching (also called mosaicing) is a technique of computational photography to
produce panorama or high-resolution images out of a set of single acquisitions showing
overlapping views. For this process three steps are required: First, image registration is
performed by aligning matching corresponding features in a set of images. Second, image
calibration has to be performed to adjust the ideal lens model to the used camera lens model
to remove artefacts like distortions, exposure differences or chromatic aberrations (a color
distortion, which creates an outline of unwanted colors around objects with highly contrast-
ing boundaries to their backgrounds). Third, image blending is executed to obtain an output
projection incorporating the calibration and registration parameters computed. The focus of
these techniques lies in providing a panorama image, which appears to be acquired in one
step with no color differences, seam lines or distortions.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
564 18 Computer Vision
■ 18.4 C
omputer Vision Applications & Future
Directions
In recent years, versatile Computer Vision systems and corresponding techniques emerged.
We will have a closer look at image retrieval systems, which require powerful Computer
Vision techniques to find matching images in databases or on websites. We will briefly dis-
cuss recent advances in the field of image- and video-based object identification, classifica-
tion and tracking, and will summarize how these strategies are used in robotics and
self-driving cars. As the last part of this section we will dive into the field of medical Com-
puter Vision applications. For this field we will discuss the specific requirements and chal-
lenges scientists face, when they are developing a medical Computer Vision system and we
will have a closer look at automatic segmentation approaches, anomaly detection and trajec-
tory learning for disease risk prediction. In this section a brief overview of only a subset of
popular applications of Computer Vision can be given. If you are interested to read more, for
example about the field of 3D vision and depth imaging, or if you want to get more informa-
tion regarding the algorithmic details of the fields presented, we recommend the publica-
tions cited in this chapter or the following books:
Richard Szeliski, Computer Vision – Algorithms and Applications Ed. 2, Springer Inter
national Publishing, ISBN 978-3-030-34371-2, https://www.springer.com/gp/book/978
3030343712, 2022, https://szeliski.org/Book/.
Simon J.D. Prince, Computer Vision: Models, Learning, and Inference, Cambridge Press,
ISBN 9781107011793, 2012 http://www.computervisionmodels.com/.
Chi-hau Chen, Computer Vision in Medical Imaging, World Scientific Publishing Co. Pte.
Ltd USA, vol 2, ISBN 978-981-4460-93-4, 2014.
The most popular image retrieval systems we work with on a daily basis are internet or
image database search engines (e.g. https://pixabay.com/), which scan websites or image
data storages for potentially similar images according to color information provided, text, or
description of an image object. Computer Vision techniques are used to compute image
features or unique descriptors of found images, and determine how similar these are in
terms of defined metrics to the query image and corresponding features.
In the medical field, image retrieval systems are of great interest. Clinical systems store all
imaging and medical record data captured in daily routine. Especially in large hospitals like
Vienna General Hospital (AKH), around 77,000 stationary patients and 450,000 patients in
the ambulatories are treated annually, creating huge amounts of routine data every day 2.
2
Annual Report 2020 Medical University of Vienna https://www.meduniwien.ac.at/web/fileadmin/content/
serviceeinrichtungen/oeffentlichkeitsarbeit/jahresbericht/Jahresbericht_2020-MedUni.pdf [accessed 23rd
September 2021]
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.4 Computer Vision Applications & Future Directions 565
Computer Vision and machine learning can help radiologists to retrieve imaging and medi-
cal record data. In a case where a patient’s diagnosis cannot be clearly derived from the
acquired medical scan and/or symptoms themselves, an image retrieval system which has
access to the hospital’s Picture Archiving and Communication System (PACS), can help to
find visually-similar cases and corresponding medical literature of differential diagnosis.
contextflow SEARCH Lung CT from contextflow GmbH (https://contextflow.com/), currently
searches for 19 different disease patterns and nodules in lung Computer Tomography (CT)
images. Figure 18.13 shows the graphical user interface of contextflow SEARCH LUNG CT.
The current patient’s query image is shown on the top left. Radiologists can mark specific
regions of interest in the lung (red square) to receive similar cases from other patients in
contextflow’s Knowledge Base (middle panel).
Figure 18.13 Graphical user interface of contextflow SEARCH Lung CT, a medical image retrieval
system for lung CTs. (Image courtesy contextflow GmbH)
After clicking on a reference case, the corresponding image and anonymized patient infor-
mation is displayed along with similarity heat maps, visualized as overlays for the chosen
reference image, allowing radiologists to easy compare similar found regions. On the right
panel, descriptions, differential diagnosis information, and tips and pitfalls are displayed.
A further field where image retrieval systems are of importance are forensics. Shoe prints,
fingerprints or toolmarks are often found on crime scenes and are important evidence of a
crime. These can easily be recorded using image-based acquisition techniques and stored in
data storage systems. Especially image retrieval systems can help here to match the query
image with images in the database. The conditions for acquiring images on a crime scene
are not always optimal, since the potential evidence (e.g. a shoe print) is not always clearly
visible in its full extent, or may be covered with dirt or other stains. The focus of Computer
Vision systems for forensic image retrieval lies in providing robust computations of descrip-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
566 18 Computer Vision
tors of the query object as well as a possibility to find matching objects of the same kind
even in a different representation format or appearance.
In Figure 18.14 a result of an automatic image retrieval system for footwear impressions,
which was developed at the Technische Universität Wien3, is visualized. This system takes
a query image and searches through thousands of images in a database of the Austrian po-
lice to provide similar shoe prints. The query image is shown on the top left and is marked
with a grey square. The remaining images show the closest results, estimated by the devel-
oped metric, where the images in the green box are the actual matches from the same shoe
(side note: the information regarding the actual match is only available for training and
evaluating the algorithm and was visualized here to demonstrate the potential different
shoe print appearances of the same shoe). Images in the database can come from other
crime scenes or be acquired in a defined setup. Thus, 100% visual matches are never achiev-
able – one of the main challenges in defining image based descriptors in this Computer
Vision application [7].
Figure 18.14 Example of an image retrieval system for forensic investigations. The query image is
marked with a grey box, the most similar images from a store of thousands of shoe print images
are marked with a red box, where the target shoe prints of the query are marked with a green box.
(Image courtesy Manuel Keglevic, TU Wien)
While the medical and forensic domains are examples of just two challenging fields, the
image retrieval systems have diverse other applications, such as in document analysis or
cultural object preservation. In a nutshell, the application of Computer Vision techniques
for image retrieval is strongly influenced by the underlying data, its quality, appearance
uniformity, reproducibility and object/region of interest to be matched.
3
This work was funded by the Federal Ministry of Agriculture, Regions and Tourism (BMLRT)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.4 Computer Vision Applications & Future Directions 567
The next field of Computer Vision applications we will present are object recognition and
detection techniques, and the classification of these objects or the tracking and surveillance
over several acquisitions (that is, multiple frames, as in a video). These Computer Vision
techniques are currently deeply researched in, for example, the automotive industry, virtual
reality applications, or robotics. We would like to refer the reader also to Section 12.6.4,
where also Computer Vision approaches are introduced using machine learning strategies
and convolutional neural networks to analyze images. Here, we can only give a brief over-
view of possible application fields of Computer Vision.
As one of the popular object detection approaches, we would like to introduce YOLO (ab-
breviation for You Only Look Once). It is able to reliably detect objects (persons, animals,
vehicles) in images or videos captured in the wild (that means the data has not been ac-
quired according to a defined setup in a studio) or in art work. Also, with this challenging
data the approach is capable to perform this task in real-time. The output of this Computer
Vision system is a bounding box around the detected object and the corresponding classifi-
cation label. The approach combines deep learning, Computer Vision and image classifica-
tion techniques. More information and technical details are provided in [8].
In the automotive industry object detection found its way into advanced collision avoid-
ance systems, which combine radar or laser technologies with camera and corresponding
object detection frameworks. The aim of these systems is to detect emerging dangers such
as pedestrians, fast breaking cars or barriers on the lane. In this case the Computer Vision
system has to work in real time and, in case of danger, set actions immediately (such as ac-
tivating an emergency braking system). In recent years this concept has been investigated
further for the development of self-driving cars. Figure 18.15 depicts a self-driving minibus
from the Charité Berlin Mitte Hospital campus in Berlin, Germany (http://www.charite.de/).
It was an initiative by Charité, the Federal Ministry for Environment, Nature Conservation
and Nuclear Safety (BMUB) and the governmental public transit service (BVG). The test
phase started in the course of the research project STIMULATE (Charité, Peitz) in 2018 with
an operator on board, in case the system fails. Since 2019 the buses, equipped with several
sensors and camera systems, run completely autonomously on defined campus routes,
transporting students, patients and co-workers at a maximum speed of 45km/h.
Figure 18.15 Self-driving Minibus at Charité Berlin Mitte Hospital Campus. (Image taken from
https://pixabay.com/de/photos/fahrzeug-autonom-autonomes-fahren-4759347/)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
568 18 Computer Vision
A survey of current expert systems and corresponding applications in the field of self-driv-
ing cars can be found in [9].
In robotics, the object recognition routines of the robot’s visual system are additionally
connected with action items (such as a mechanical arm) and other sensors (distance mea-
sure) to analyze the scene. For example, Max-AI (https://www.max-ai.com/) is an artificial
intelligence enabled robot that identifies recyclables and other items for recovery. Its Com-
puter Vision system integrates object detection techniques in combination with a deep
learning neural network and a robotic arm, trained to detect, classify, grab and sort waste.
It supports and speeds up the recycling process and should be a necessary future direction
of Computer Vision systems, to support existing efforts to reduce waste and thereby protect
our climate.
Medical Computer Vision focuses on the understanding and analysis of medical images. In
this specific field, as in classic photography, the visible spectrum of light is one way to ob-
tain images from the body or body parts (for example dermatoscopy images (a technique to
examine the skin using skin surface microscopy) or histological images (a modality to ex-
amine the microscopic structure of tissues and cells). Other imaging techniques, based on
x-ray absorption characteristics of tissues (computer tomography, radiograph, mammogra-
phy), magnetic properties of hydrogen molecules (e.g. Magnetic Resonance Imaging – MRI),
metabolic or radioactive tracer activity (e.g. Positron Emission Tomography) or the different
reflective behaviour of ultrasound waves in the human body, can be used to obtain scans in
2D, 3D or 4D. The challenge in this field lies in making structures outside and – even more
importantly – inside, a person visible, without harming the body by needing to open and
examine it (exceptions are for example, biopsies or intraoperative imaging). We can differ-
entiate between invasive imaging techniques, whose application to the human body leaves
traces and can only be performed to a certain extent without harming the organism (x-rays,
CT, PET) and non-invasive techniques like low field MRI or ultrasound, techniques preferen-
tially used for imaging sensitive patients (such as fetuses or infants).
In medical Computer Vision applications (image segmentation, registration, classification
or prediction) following dynamics form the major challenges [10]:
Interpatient variability: structures of interest appear differently in scans according to
the patient’s specific phenotype (observable physical appearance).
Intrapatient or longitudinal dynamics: developmental or pathological processes and
surgical procedures can affect the size, position or deformation pattern of the observed
region over time.
Modality dependent variability: structures are visualized differently between image
modalities (varying image contrast, partial visibility).
Imaging protocol and machine caused variability: depending on the protocols or ma-
chines used for imaging, variations in image contrasts, resolution, imaging artefacts and
varying field of view can all be expected.
The application of Computer Vision systems in medicine is versatile, where segmentation
plays an important role. It enables the automatic delineation of a tissue or an organ in scans,
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.4 Computer Vision Applications & Future Directions 569
which consequently enables the determination of its size, volume, location and change over
time, and thus the planning of further treatments and surgical interventions.
Figure 18.16 MR acquisition of the fetal brain in-utero (left) and corresponding annotations of
brain tissues (right). (Image courtesy Medical University of Vienna)
When it comes to training a Computer Vision system for a segmentation task, the target
labels will be annotations of the regions of interest, which are produced by a medical expert.
The training process involve the optimization of predicted segmentations by comparing the
system’s outputs with the target region, using metrics which assess the quality of the seg-
mentation and the overlap with the baseline. In Figure 18.16 a slice of an MR acquisition of
a fetal brain in-utero is visualized on the left side. On the right, for the same image, annota-
tions of different brain tissues in different colors are shown [11].
Also in recent years, deep learning based approaches are infiltrating the field of medical
Computer Vision and have resulted in versatile application possibilities (see [12] for a de-
tailed review). To make these techniques also applicable in the medical field, extensive
amounts of data and (depending on the task) also annotations from experts are required.
Obtaining segmentation baselines is time consuming, especially in the case of accurate de-
lineations of complex volumetric structures (like the cortical surface of the brain) and con-
sequently not feasible for large datasets. Thus, recently proposed segmentation approaches
focus on developing routines which are capable of learning segmenting structures while
being trained with scars annotations or even in an unsupervised way (without annotations)
only (cf. [13]). A recent review of medical image segmentation approaches is summarized in
[14].
Another important application of Computer Vision in medicine is image registration. This
technique enables the alignment of images or volumes (3D images) according to defined
landmarks or regions in medical scans. It provides a baseline for patient-specific longitudi-
nal analysis or population focused comparison. It is also a potential strategy to define refer-
ence coordinate systems (e.g. brain atlases, whole-body MRI atlases), to which images can
be aligned for comparison or for the estimation of deviations or anomalies regarding this
reference. A review of recent medical image registration approaches can be found in [15].
The images obtained in the clinical routine are in most cases bound to defined guidelines
and clinical protocols, which vary among clinical institutes, countries and machines. This
requirement must be explicitly observed in large multi-sided studies or when datasets are
merged. Here, image normalization strategies are required to avoid situations where the
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
570 18 Computer Vision
Computer Vision system learns the machine- or protocol-caused variability in the data in-
stead of being trained on the actual variability of the region of interest. Thus, a group of
Computer Vision applications in medicine focus on representation learning on the one hand
and on estimating a mapping between different kinds of data representations (domain ad-
aptation) using unsupervised approaches (see [16] for more details).
Computer Vision systems are also present in the operating room during surgical proce-
dures. Hyperspectral imaging (HSI) is a safe, non-contact and non-invasive optical imaging
modality perfectly suited for the clinic. By splitting light into multiple spectral bands far
beyond the conventional red, green, and blue colours that the naked eye can see, HSI is an
emerging technology that can provide critical, but currently unavailable, information for
real-time tissue differentiation during surgery. In particular, this technology can provide
both semantic (e.g. tumor vs. non-tumor) and functional (e.g. blood perfusion and oxygen-
ation saturation levels) tissue information which can help to increase surgical precision and
patient safety while optimizing resection during tumor surgery. However, HSI data is very
complex and requires advanced computer-processing and machine learning for its real-time
visualization and interpretation during surgery [23]. Following successful first in-patient
clinical feasibility case studies of a real-time intraoperative prototype system [24], Hypervi-
sion Surgical Ltd (https://hypervisionsurgical.com/) is currently working on converting this
novel technology into a commercial medical device for real-time surgical guidance (cf. Fig-
ure 18.17).
Figure 18.17
Picture of a real-time prototype hyperspectral imaging system
during an ethically-approved in-patient clinical study (Ebner et al.
2021 [24]). (Picture with CC-BY license)
There is also an application field in medical Computer Vision, which makes computers not
only see but also foresee which regions have a high risk of developing a disease. In Fig-
ure 18.18 an example of time dependent image-based prediction in multiple myeloma is
visualized [17] [10].
The left side depicts a slice of a magnetic resonance image acquired at the precursor state
of the disease, showing the femur with bone lesions (dark dots) and a detailed view of this
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.5 Making Humans See 571
bone region. One year later the patient was scanned again (middle) with observable bone
lesion growth marked in red. The prediction result for that same future time-point, but
based on the precursor state, is shown on the right. The approach is capable of locating po-
tential regions of lesion growth, but, since it is a prediction and not a segmentation task
(more challenging) exact delineations of the target region are not yet achievable.
Figure 18.18 Longitudinal prediction of bone lesion growth risk in multiple myeloma. (Image cour-
tesy Roxane Licandro)
To conclude this section we want to point out- that in medical Computer Vision, the humans
being imaged are always the focus of the procedure, and ethics and clinical priorities have
to be kept. This includes the careful planning of imaging studies, obtaining consent of the
patient for the procedure and for being able to use the data within a Computer Vision system,
the secure storage of and access to the data as well as the careful preparation and prepro-
cessing of it. The standards and required accuracy of medical Computer Vision systems are
usually higher than those of other Computer Vision fields, since false or introduced imaging
artefacts by the system can trigger false diagnoses or treatment estimations, which conse-
quently can cause – in an extreme case – the patient’s death. It is of great importance to
obtain reproducibility of results of the Computer Vision systems, in order to help medical
experts to understand the conclusions the system made. In terms of the evolution of complex
neural networks this has become a huge challenge as in all other Computer Vision fields. So
how can we make humans see and understand what a computer learned to see? This and
related aspects will be discussed in Section 18.5.
Figure 18.19 Schematic simplified illustration of the vision cycle – interaction of computers and
humans in the vision process. (Image courtesy Roxane Licandro)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
18.6 In a Nutshell 573
And here we are, closing the vision cycle as depicted in Figure 18.19. One open question still
remains: if we let computers see like humans, or, consequently if computers understand
how humans see, can we still believe what we see? The key to responsibility and thoughtful
research will always be the humans in between.
■ 18.6 In a Nutshell
References
[1] D. Marr, Vision, MIT Press Cambridge Massachusetts US, 2010, ISBN 9780262514620
[2] D. A. Atchison, George Smith, Chapter 1 – The human eye: an overview, Editor(s): David A. Atchi-
son, George Smith, Optics of the Human Eye, Butterworth-Heinemann, Pages 3–10, 2000, ISBN
9780750637756, https://doi.org/10.1016/B978-0-7506-3775-6.50005-5
[3] R. Szeliski, Computer Vision – Algorithms and Applications Ed. 2, Springer International Publish
ing, 2022, ISBN 978-3-030-34371-2, https://www.springer.com/gp/book/9783030343712
[4] H.Y. Im, S.C. Chong, Computation of mean size is based on perceived size. Attention, Perception, &
Psychophysics, 71, 375–384, 2009, https://doi.org/10.3758/APP.71.2.375
[5] P. Wenderoth, Visual Illusions. In: Binder M.D., Hirokawa N., Windhorst U. (eds) Encyclopedia of
Neuroscience. Springer, Berlin, Heidelberg, 2009, https://doi.org/10.1007/978-3-540-29678-2_6356
[6] D. Alleysson, S. Susstrunk and J. Herault, “Linear demosaicing inspired by the human visual
system,” in IEEE Transactions on Image Processing, vol. 14, no. 4, pp. 439–449, 2005, http://doi.
org/10.1109/TIP.2004.841200
[7] M. Keglevic, R. and Sablatnig, Retrieval of striated toolmarks using convolutional neural net-
works. IET Comput. Vis., 11: 613–619, 2017, https://doi.org/10.1049/iet-cvi.2017.0161
[8] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-Time Object
Detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 779–788, 2016
[9] C. Badue, R. Guidolini, R. Vivacqua Carneiro, P. Azevedo, V. B. Cardoso, A. Forechi, L. Jesus, R.
Berriel, T. M. Paixão, F. Mutz, L. de Paula Veronese, T. Oliveira-Santos, A. F. De Souza, Self-driving
cars: A survey, Expert Systems with Applications, Volume 165, 2021, 113816, ISSN 0957-4174, https://
doi.org/10.1016/j.eswa.2020.113816
[10] R. Licandro, “Spatio Temporal Modelling of Dynamic Developmental Patterns”, TU Wien, March
2021. https://doi.org/10.34726/hss.2021.39603
[11] R. Licandro, G. Langs, G. Kasprian, R. Sablatnig, D. Prayer, E. Schwartz, “Longitudinal Atlas Learn-
ing for Fetal Brain Tissue Labeling using Geodesic Regression”, WiCV Workshop at the IEEE Con
ference on Computer Vision and Pattern Recognition, 2016
[12] A. Esteva, K. Chou, S. Yeung et al. Deep learning-enabled medical Computer Vision. npj Digit. Med.
4, 5, 2021, https://doi.org/10.1038/s41746-020-00376-2
[13] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, X. Ding, Embracing imperfect datasets: A
review of deep learning solutions for medical image segmentation, Medical Image Analysis, Vol-
ume 63, 2020, 101693, ISSN 1361-8415, https://doi.org/10.1016/j.media.2020.101693
[14] X. Liu, L. Song, S. Liu, Y. Zhang, A Review of Deep-Learning-Based Medical Image Segmentation
Methods. Sustainability, 13(3):1224, 2021, https://doi.org/10.3390/su13031224
[15] Haskins et al. 2020, “Deep Learning in Medical Image Registration: a Survey”, Machine Vision and
Applications, 31:8, 2020, https://doi.org/10.1007/s00138-020-01060-x
[16] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, X. Ding, Embracing imperfect datasets: A
review of deep learning solutions for medical image segmentation, Medical Image Analysis, Vol-
ume 63, 2020, 101693, ISSN 1361-8415, https://doi.org/10.1016/j.media.2020.101693
[17] R. Licandro, J. Hofmanninger, M. Perkonigg, S. Röhrich, M.-A. Weber, M. Wennmann, L. Kintzele,
M. Piraud, B. Menze, G. Langs, “Asymmetric Cascade Networks for Focal Bone Lesion Prediction
in Multiple Myeloma”, International Conference on Medical Imaging with Deep Learning (MIDL),
2019. https://arxiv.org/abs/1907.13539
[18] P. Linardatos, V. Papastefanopoulos, S. Kotsiantis, Explainable AI: A Review of Machine Learning
Interpretability Methods. Entropy, Volume 23, 18. 2021, https://doi.org/10.3390/e23010018
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 575
[19] E. Zakharov, A. Shyshey, E. Burkov, V. Lempitsky, Few-Shot Adversarial Learning of Realistic Neu-
ral Talking Head Models. https://arxiv.org/abs/1905.08233
[20] Judge H.B. Dixon Jr (Ret.), “Deepfakes: More Frightening Than Photoshop on Steroids”, The Judges’
Journal, Volume 58, Number 3, 2019
[21] U. A. Ciftci, İ. Demir and L. Yin, “How Do the Hearts of Deep Fakes Beat? Deep Fake Source Detec-
tion via Interpreting Residuals with Biological Signals,” 2020 IEEE International Joint Conference
on Biometrics (IJCB), pp. 1–10, 2020, https://doi.org/10.1109/IJCB48548.2020.9304909
[22] S. Agarwal, H. Farid; Detecting Deep-Fake Videos from Phoneme-Viseme Mismatches, Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp.
981–989, 2021
[23] L., Peichao, M. Ebner, P. Noonan, C. Horgan, A. Bahl, S. Ourselin, J. Shapey, and T. Vercauteren.
2021. “Deep Learning Approach for Hyperspectral Image Demosaicking, Spectral Correction and
High-Resolution RGB Reconstruction.” MICCAI Workshop on Augmented Environments for Comput
er-Assisted Interventions, Computer Assisted and Robotic Endoscopy, and Context Aware Operating
Theaters (In Print), September, 12. http://arxiv.org/abs/2109.01403
[24] M. Ebner, J. S.Eli Nabavi, X. Yijing Xie, F. Liebmann, J. M. Spirig, A. Hoch, et al. 2021. “Intraopera-
tive Hyperspectral Label-Free Imaging: From System Design to First-in-Patient Translation.” Jour
nal of Physics D: Applied Physics 54 (29): 294003. https://doi.org/10.1088/1361-6463/abfbf6
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
■ 19.1 Introduction
The aim of this chapter is to describe the standards in modelling and simulation, with
special focus on the description of different modelling methods and their usage. These will
be illustrated using a range of application examples, from infectious disease modelling
(COVID 19) and transportation simulation, which will highlight both model calibration and
discrete process simulation, to simulation of inventory policy. In order to show how data
science is integrated in the modelling process and in the interpretation of results, we will
begin with an overview of a modelling process in general. Then we will briefly describe
different modelling methods and their advantages and disadvantages. The subsequent sec-
tions will explain how to handle a model from parametrization and calibration to verifica-
tion, validation and, finally, to simulation experiments and scenarios that provide results.
Besides building the model, the running of simulation models is essential. A simulation
executes the model with a defined parametrization, enabling you to validate the logic of
your behavioral model. Analyzing the simulation results, their graphical interpretation, and
classical statistics, are all part of the realization of a modelling and simulation project. So
too is the explanation of simulation results based on a (high) number of simulation runs of
a model with stochastic parameters.
Some PDEs have exact solutions, but in general, numerical approximation solutions are
necessary. Which numerical standard method to use depends strongly on the underlying
described system, which is defined by a PDE, and the domain and required level of detail of
the research question. The finite difference method, for instance, works by approximating
the derivatives in the PDE and then, using a large number of incremental values of the in-
dependent variables, calculates the unknown function at each of those values.
The finite difference method is often regarded as the simplest method to learn and use. The
finite element and finite volume methods are widely used in electro technics and fluid sim-
ulation. Multigrid methods are also a standard method in application. In general, PDE mod-
elling and simulation can be seen as a separate working area and scientific discipline, which
is not a part of this chapter.
Reality
↓
System
Idealised System
Mathemacal Relaons
Non-causal Model
Causal Model
↓ Identification of parameters
Quantave Model
ings will be detected, and assumptions may be questioned. Beneath all the challenging and
possibly cost intensive tasks related to achieving reproducibility, one should keep in mind
the benefit of these efforts. There is a need to pay special attention to documentation, visu-
alization, parameter formulation, data preparation [10], verification and validation. For de-
tailed information concerning this process, the work of Popper gives further insights [11].
Understanding the lifecycle of the development process behind a modelling and simulation
project is essential for talking about reproducibility. The reason for this is that one needs to
know exactly what information is produced in which phase of the development process.
Combined with parameter formulation, documentation, and verification as well as validation,
understanding the lifecycle is crucial to produce reliable and usable results. It is through
this, that knowledge confirming hypotheses (rectification) can be gained or wrong hypothe-
ses (falsification) can be identified.
As modelling and simulation can be performed in several domains, depending on the field
one is working on, there may be huge collections of hypotheses due to missing information.
In some domains, for example modelling archeological transportation pathways or natural
disease progression of cancer patients where ethical aspects do not allow to observe un-
treated behavior of identified cancer cases, one has to mainly build on subjective observa-
tions and characterizations of the system instead of given parameters (“Black Box” Model-
ling, compare Figure 19.3).
In this case falsification of assumptions and hypotheses can be a good research question, for
example, to reduce the number of possibilities and the parameter estimations. What sounds
disturbing for domains of classical engineering might be helpful and interesting in the field
of health system research. So, one needs to stay flexible, open-minded, and not to forget to
think in all directions.
• Electrotechnique • Laws
Figure 19.3 From White Box Modelling, such as an ordinary differential equation of a well-known
physical process, to Black Box Modelling, where the detailed rules are not known.
When looking at functional relations used by the model the amount of a priori information
known can vary. In the worst case only the input and output functions are available. The
system requires that this input matches with the appropriate output. Without any further
information the mechanism behind the correlation is an estimated function. For looking at
the correlation between hypertension and the amount of sport a person does, the complete
equation system describing the impact of sport on the human body is not available; there-
fore it has to be estimated by a function or look-up table. The process that is happening in
the background is a complete black box (Figure 19.3).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.4 Reproducibility and Model Lifecycle 583
On the other hand, calculating the stability of a broken bone that is fixed with nails is more
of a white box. Although some parameters describing bone structure are not known exactly
the interaction between the bone and nails can be described using exact physical functions.
As the last example shows Black-Box and White-Box are not two completely distinct model-
ling techniques but there is a fluent passage from one conception to the other. White-Box
models provide more insight into actual dynamics. Physical parameters and system behav-
ior can be changed and analyzed in detail. They are far closer to reality, explainable and
therefore better to provide transparency to the results. Their drawbacks are that they need
much more knowledge than Black-Box models, are more complex, more processing inten-
sive and the parameter identification can be quite difficult. Nevertheless White-Box models
should be chosen over Black-Box models whenever possible.
Lifecycle means the process of developing a model and implementing it, including its param
etrization, as a simulation in a well-defined simulation environment. To understand what
reproducibility signifies, it is essential to first look at this subject.
At first the basic constituents have to be defined, namely, phases, contained concepts and
resulting deliverables.
In general, a modelling and simulation project evolves rather iteratively, more like in a spi-
ral process than in a straight-forward manner. The model normally has to be redefined
several times, until it is ready to fulfill the preset goal. Importantly, one has to keep in mind
that a model can never depict reality one-on-one but only a (simplified, limited) portion of it.
To make abstractions and assumptions as well as defining boundaries is fundamental for
modelers to get a simple and over all computable view, which is yet also detailed and com-
plete enough to answer the study question. That is a reason for the consistent revision of the
process, until the right assumptions and abstractions are specified. The basic structure of
the lifecycle is listed below, and a generalized view of a modelling and simulation lifecycle
is shown in Figure 19.4, which is based on the work of Sargent [12] and Niki Popper [11]
and was slightly adapted by the author.
Computer Programming
and Implementaon
Presentaon
of Simulaon
Results igure 19.4
F
Generalized lifecycle of creating
a model
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
584 19 Modelling and Simulation – Create your own Models
The information you get is the basis for every modeling and simulation study. This input can
be manifold - it can, for example, consist of studies, databases, expert knowledge, or statis-
tical evaluations. We may differentiate between general and data information. While general
information is non-measurable and often subjective input, data contains quantifiable and
rather objective information.
Looking at the lifecycle one can observe that in each of its stages, data and/or general infor-
mation is transformed into an output. This output is used as an input by the subsequent
stages, as depicted in Figure 19.5.
Data and general information enter the model in the form of parameters through a transfor-
mation process. As the term “parameter” is used in different meanings across various dis
ciplines, keep in mind that the mathematical or computer science view is used here.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.4 Reproducibility and Model Lifecycle 585
In Figure 19.6 an extended version of a lifecycle is depicted which is based upon Balci [13,
14]. Here you can get more information about how and when the definition and transfor
mation of parameters, and their parametrization and calibration take place.
Communicated
Problem
Formulated Problem
Proposed Soluon
Technique
System Knowledge
and Objecves
Simulaon Results
Conceptual Model
Experimental Model
Figure 19.6 Detailed lifecycle of a simulation study reprinted by permission of the publisher
The modelling lifecycle starts with collecting general information and data. Both are trans-
formed into structural knowledge about the system, which happens by contributing to the
“System Knowledge and Objectives” phase. The data values themselves are not of impor-
tance in this phase of the lifecycle; the only essential part here is to determine which infor-
mation could be useful for the model. Having such an input it may, for example, be possible
to define first boundaries.
For the conceptual model, structural knowledge must be transformed into information that
the model needs or produces. This means that the modeler tries to establish dependencies
between the types of information the model will likely use. Concrete parameters are then
defined in the communicative models. These serve a specific audience and are guided by the
gathered general knowledge and the format of the data. This phase produces a well-defined
set of parameters that can answer the given problem. It is normally a minimal set of param-
eters, where abstractions and generalizations are already applied.
As soon as proper communicative models have been developed with a minimal set of param-
eters and outputs, the implementation can start. Afterwards, the (mathematical) notion of
parameters and outputs are transferred into programming-language specific constructs
and, thus, result in the Programmed Model.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
586 19 Modelling and Simulation – Create your own Models
At this point in the lifecycle, it is possible to run the simulation. While it is not necessary to
sweep through the whole range of possible parameter values, it is necessary to find reason-
able parameter values. For this reason, the model is now subjected to experimentation and
becomes the Experimental Model.
Depending on the type of parameter under consideration, there are two different techniques
for finding reasonable values:
Parametrization (known parameter values): In this case parameter values can be de-
rived from data values which are known a priori. If the data cannot be transformed in a
way that all parameter values can be inferred, further investigations are necessary. This
can be done by broadening the databases, gaining new data by conducting experiments
or others. If this is not possible one has to switch to
Calibration (unknown parameter values): These values must be estimated, assisted by
given data (gained through, for example, statistical evaluations, studies, previous simu
lation runs, and so on). After running the experimental model, the simulation output is
compared to a possibly known output data or constraints governing the output which can,
for example, be given by experts from the specific domain.
If the simulation model produces an output that appears to be reasonable, the calibra-
tion task is finished, and the resulting parameter values are used for further simulation
runs.
If the simulation output does not fit, the unknown parameter values have to be esti-
mated again and the experimentation process has to be reiterated, often supported by
mathematical optimization.
If calibration does not work, a redefinition of the model might be necessary or one of
the known parameter values may have to be questioned. In this case the modelling
process may have to start all over again due to a possibly changed knowledge about the
system.
One has to keep in mind that in some domains, such as social sciences, imposing con-
straints on what could be an output is already an essential contribution in research. In that
context one can argue that the purpose of modelling and simulation studies in some do-
mains does not primarily lie in the generation of results, but in an exclusion of unlikely
scenarios that would possibly enter the scientific thought process and lead to incorrect
conclusions. The definition of constraints is a positive side-effect of performing such a study.
It may even lead to the insight that a previously supposedly well-understood problem needs
to be re-examined and clarified.
The so-called qualitative modelling is in several cases already the benefit that is most im-
portant for the decision-maker in industry or research.
In case of healthcare modelling, the problem definition and research question is at best
defined by an interdisciplinary board. The benefit of defining the demand and point of view
is normally decoupled from data and available information, which is in general the basis for
choices of modelling technique . The chosen modelling technique results in feedback about
the data needed, the data itself iteratively influences the method and of course information
has to be collected if a hybrid model-decomposition or sequential coupling of different mod-
elling methods is needed to solve the problem. These processes are in reality iteratively
developed; therefore, the whole model implementation should be realized in modules, so
that high flexibility and the ability to be validated is guaranteed (Figure 19.7, [15]).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.4 Reproducibility and Model Lifecycle 587
Feedback to model
pool
19.4.3 Documentation
Without documentation, reproducibility is hardly possible. There are three forms of docu-
mentation commonly used in the donation of modeling and simulation:
1. textual documentation,
2. visualization, and
3. the source code itself, which necessarily needs its own, thoroughly written and main-
tained documentation.
It is hard to find a way to make good documentation accurately and efficiently in a project
team. But there are several helpful techniques, such as always requiring that documenta-
tion be written in Basic English. There exist several guidelines, like the ODD (Overview,
Design Concepts, and Details) protocol [16].
Visualization is crucial besides textual information when trying to document and validate
simulation models and to make simulation models reproducible. The process of modelling,
and fundamental structural ideas concerning the model, can often be better presented via
visual concepts. Note that visual analysis can also support exclusion of certain hypotheses,
possibly even more so than proving with reference to some data.
Visualization includes the data analysis as well as modelling process and structure. In par-
ticular, the visualization of the modelling structure can be very helpful in the discussion
process in bigger teams, to gain information on dependencies of variables and feedback
loops.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
588 19 Modelling and Simulation – Create your own Models
The documentation of the source code in a modelling and simulation model is similar to
good practice in other software projects. Certain practices should be strictly adhered to,
including, the well-known basics like choosing proper names, the do one thing principle,
and the don’t repeat yourself principle.
Verification and validation are processes that need to happen in parallel to the development
of the model as well as the simulation. Their very important aim is to guarantee a targeted
development of the simulation study.
Verification answers the question “Is the model developed right?” which needs to be
asked in a modeling and simulation study. Verification tells us if a model is implemented
correctly.
Validation deals with the question “Is the right model developed?” which is another
essential research question that needs to be looked at. Validation addresses this task.
While verification and validation are usually performed by the development team itself,
a better idea is to have this part performed by an independent team consisting of people
familiar with modelling together with people connected to the field of study (for instance,
specialists in health care, infrastructure, production and logistics).
A slightly different formulation concerning the processes of verification and validation is
addressed by the following questions [17, 18]:
Is the model specification complete and is the model implementation correct? (Verifica-
tion)
Does the model resemble the system under examination and its behavior closely enough
for answering the questions of the study? (Validation)
In general, it is not possible to conclusively answer these questions with yes. One can rather
conduct several tests trying to falsify the correctness or the validity of the model. Each test
the model passes, adds to its credibility. Especially the process of validation should go on
while the model is in use, as each additional piece of knowledge and data can allow new
tests and, thus, new insights. For example, a model with the purpose of comparing the
standard treatment of an illness with a future disease management program should be
revalidated when the program is already in use and has produced enough data for a new
test.
Though the most important verification and validation tests naturally take place after model
development, it is wise to perform them after each iteration step of the modelling study, as
this approach can save a lot of time and money. Law [19] explicitly warns against just per-
forming validation tests after development of the model or – even worse – not at all if there
is not enough time and money left. Furthermore, independent verification and validation by
knowledgeable people who are not part of the model development team adds to the credi
bility of a model, as model developers have a natural interest in positive verification and
validation results [17].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.4 Reproducibility and Model Lifecycle 589
19.4.4.1 Verification
Verification of simulation models does not differ from verification of computer software in
general. It is based on a complete and consistent specification of the model. Tests concern-
ing errors in the code and verification of fully specified model parts can take place before
the complete model specification is available.
There are two categories of verification tests: static tests and dynamic tests [18]. Static tests
analyze the structure and the code of models without execution, whereas in dynamic test-
ing, model execution and the generated results and behavior are used for verification. In
detail:
Static Verification Tests
Cross-Check: A skilled person that has not developed the model (or the concerned part
of the model) examines the (commented) code of the model with help of the specifica-
tion
Structured Code Walk-Troughs [19]: The implementation of the model is presented in
group meetings, which ideally consist of both members of the model development team
and external people. The developers present each part of the model in detail and ex-
plain the code. The audience can criticize and audit the correctness.
Structural Analysis: This approach uses a control flow graph of the model structure. The
graph is a representation of all paths which the model might traverse during its execu-
tion. An examination of the graph can reveal structural anomalies, such as multiple
entry and exit points, excessive levels of nesting within a structure and the use of un-
conditional branches [20].
Formal Methods: Formal methods try to prove the correctness (as far as implementation
is concerned) of a model [19]. When it can be done, it is extremely useful, but these
methods are often not feasible for reasonably complex models.
Dynamic Verification Tests
Double Implementation: Two independent implementations of a model have to produce
the same output with the same input and parameter settings. The teams which make
the different implementations should of course consist of different people.
Unit Testing: Each component of the model is tested individually. This is a bottom-up
approach, as one starts with the smallest building blocks (for example functions) and
tests increasingly larger structures.
Structured Debugging Walk-Throughs: This is appropriate when code contains bugs
which are leading to execution errors. In this test, the model execution of various test
cases is followed line by line. It allows the examination of the value of every variable at
each state. Thus, the audience can see at which code lines the model execution leads to
an error.
19.4.4.2 Validation
Even a correctly implemented model can be a bad representation of the investigated system.
Hence, validation is an important part of the model development process. In contrast to
verification (where there exist formal, exact methods in theory) there cannot be a perfect
validation. To give an example: in order to know whether a model produces exact values for
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
590 19 Modelling and Simulation – Create your own Models
level of detail is enough). Therefore, one has to test if there are omitted parts which
could have an influence on model outcome. The model passes the test if for no parts a
theory can be built which would explain such an influence (in this stage it is a struc-
tural test). If a part could still have a significant effect, the model builders have to inte-
grate it into the model and test the model with and without the part. The test is passed
if there are no differences in model behavior (therefore this is a test of model behavior).
■ 19.5 Methods
Based on the theory explained in the first part of the chapter, the next step is the definition
and description of standard modelling methods used in a broad range of real-world appli
cations. The theory, examples, and advantages and disadvantages are listed. The concept is
based on modelling method classification, starting with two macroscopic methods (Ordi-
nary differential equations, System Dynamics) followed by the most used microscopic tech-
niques (discrete event simulation, agent-based modelling).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
592 19 Modelling and Simulation – Create your own Models
19.5.1.1 Theory
Differential equations are the most used method to describe dynamic models. They can be
divided into two classes:
Ordinary Differential Equations (ODEs)
Partial Differential Equations (PDEs)
An ODE is an equation relating a function of one independent variable with its derivatives
and other functions. An ODE system consists of several equations relating several functions
and their derivatives.
Modelling progress of state variables over time induces time as an independent variable
[23]. When talking about differential equations in the field of health systems research, HTA
(Health Technology Assessment) or economics, these are usually explicit ordinary differen-
tial equations. Hereafter the regarded equations have the form
where x is a vector of state variables.
Many simulation programs, although not mentioning ODEs directly, use them in the back-
ground. System dynamics, for example, provides a modelling concept and graphical inter-
face to implement models but in the background uses ODEs to calculate results.
Example
The first epidemic model explicitly analyzed with ODEs comes from Kermack & McKendrick
[24]. The population is split up into three groups: susceptible (S), infected (I) and recovered
(R) where recovered people remain resistant against infection. The parameter b is the infec-
tion rate, g the recovery rate.
Equations of the Kermack & McKendrick SIR model:
The work of Kermack & McKendrick provided the foundation for various other models, like
population dynamics, further state variables for inhomogeneous populations or disease
spreading dynamics.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.5 Methods 593
Advantages
ODEs provide a very mathematical and often challenging way of describing dynamic sys-
tems. Their advantages are that they can accurately describe physical laws and the use of
continuous time, which means events are not restricted to certain time points. Changes in
the system behavior can be functions of time or can happen because of relationships be-
tween state variables. There are various tools to analyze ODE systems and many mathemat-
ical methods to investigate system behavior.
Disadvantages
Huge ODE systems are very confusing. Enlarging the system can be very difficult and often
requires changes in the model structure. Adding the distinction between male and female
in a basic SIR-model, for example, not only doubles the number of states but requires every
single equation to be revised. The biggest disadvantage is that without mathematical knowl-
edge it is neither possible to create an ODE system nor to understand the equations and the
impact of small adjustments.
SD depicts systems as consisting of a set of levels (also called stocks; one can imagine them
as reservoirs or water tanks) which are connected by flows or rates (one might imagine
flows as pipes). Each level stores a numerical state. Flows change the values of states over
time (see the realization using Vensim simulation environment in Figure 19.8).
parameter
auxiliary variable
inflow rate oulow rate
Level 1 Level 2
flow oulow
inflow
Figure 19.8 Elements of the System Dynamics notation. A continuous flow goes from one level to
the other. There is both an inflow into and an outflow from the system. Inflow rate, parameter and
outflow rate are constant parameters. Single arrows show causal connections between quantities.
For more complex models the use of auxiliary variables is helpful. These are variables that
can be calculated algebraically from levels and other auxiliary variables. The use of auxil-
iary variables does not change the model behavior, but it adds clarity and helps in model
development.
Figure 19.8 shows the graphical System Dynamics notation of a simple system with two
levels. It contains both an inflow and an outflow. The flow (which could consist of material,
people, information or anything else according to the units stored in the levels) from Level 1
into Level 2 depends directly on the value of the former, and indirectly (via the auxiliary
variable whose definition could contain any sophisticated calculation that uses Level 2 and
the constant parameter) on the value of the latter.
levels, flows or auxiliaries) and fills in the equations. On the other hand it makes a qualita-
tive analysis possible and gives valuable insight. Furthermore, it is easy to extract feedback
loops out of the causal loop diagram.
auxiliary variable
+
+
flow + Level 2
+
igure 19.9
F
Positive (reinforcing) feedback loop for Level 2.
Plus signs label positive causal relations
Feedback loops are positive or reinforcing if the number of negative causal relations in the
feedback loop is even. An example for this is shown in Figure 19.9, where larger values of
Level 2 lead to a larger flow into Level 2. On the contrary, feedback loops are negative or
dampening if the number of negative causal relations in the feedback loop is odd. An exam-
ple for this is shown in Figure 19.10, where the flow out of Level 1 rises with the value of
Level 1.
-
Level 1 - flow
igure 19.10
F
+
Negative (dampening) feedback loop for Level 1.
The m
inus sign labels the negative causal relation
between flow and Level 1
Advantages
System Dynamics takes a global view of the system, which helps to focus on the important
dynamic relationships.
The methodology describes a structured process, from the identification of important
quantities and their causal relationships (causal loop diagrams) over the construction of
stock and flow diagrams to the specification of the equations.
The graphical notation is intuitive and easy to learn.
Skills in a programming language are not necessary.
Most System Dynamics models execute very fast (some in seconds) which makes it possi-
ble to do a lot of simulations and sensitivity analyses.
Mathematical tools for analysis of systems of differential equations are fully applicable to
System Dynamics models.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
596 19 Modelling and Simulation – Create your own Models
Disadvantages
System Dynamics – as a top-down approach – treats quantities inside one level (for exam-
ple patients) as homogeneous. If this is not a valid assumption one must introduce sepa-
rate levels, which can increase model complexity.
The approach cannot incorporate spatial relationships.
It is necessary to think in global terms and quantities. Often, thinking of the behavior and
reactions of individuals is more intuitive.
Quantities are continuous. This might be problematic if the number of entities looked at
is low.
The graphical notation can get quite confusing in complex models with many causal rela-
tionships (some simulators allow vector-valued quantities, which might help).
19.5.3.1 Theory
Discrete event simulation is one way of building up models to observe time based (or dy-
namic) behavior of a system. There are formal methods for building simulation models and
ensuring that they are credible. During the experimental phase the models are executed
(run over time) in order to generate results. The results can then be used to provide insights
into a system and serve as a basis for decisions.
The main characteristic of Discrete Event Simulation is that the system state does only
change at certain time points when events occur. Time moves from one of these events to
the next, the time in between is of no relevance.
Each event has a time of occurrence. If an event takes place it may cause changes to the state
of individual objects as well as the system itself. These changes occur right at the time of the
event or after a certain time delay, but not slowly over time as it may happen in continuous
simulation. Any changes happen within a certain time point.
The occurrences of events and the points in time at which they take place create the time-
line of the simulation run.
Discrete Event Simulation Specification (DEVS) is a widely used approach for the modelling
and simulation of dynamic discrete systems. The modern object-oriented DEVS worldview
regards active objects (entities) passing passive objects (stations) along a given path.
Event List
The occurrence of events and their order need to be controlled in some way. In discrete
simulation software programs this is usually done by a so-called event list or event chain.
This list contains all future events as well as some additional information like the time of
occurrence and optionally a priority. During the simulation run, new events are added to
this list. Only if all events are executed and the list is empty has the simulation run reached
its logical end. Of course, additional conditions for ending the simulation run can be set, and
then the simulation run will be stopped even if the event list does still contain future events.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.5 Methods 597
One big concern in Discrete Event Simulation is the handling of events that take place at the
same time. Usually, the event list contains all events in order of their occurrence. Events
that will take place at the same time are simply listed in the order they have been added to
this list.
To ensure the correct order of events, priorities have to be assigned to give a ranking for the
execution. This corresponds to the Select function in Definition 2, below.
where
X is the set of input values
S is the set of states
Y is the set of output values
is the internal transition function
is the external transition function,
where
is the total state set
e is the time elapsed since the last transition
denotes the collection of bags over X (a sets in which some
elements may occur more than once)
is the confluent transition function
is the output function
is the time advance function
where
are the sets of input and output values of the coupled model
D is the set of component references, so that for each is a DEVS model
For each is the set of influencer models on subsystem d
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
598 19 Modelling and Simulation – Create your own Models
otherwise
select: is a tie breaking function for simultaneous events; it must verify
select the set of components producing the simultaneity of events
19.5.3.3 Example
A typical event would be the entering of a queue in front of a server. Examples for discrete
event models are classic server-queue models like supermarkets or emergency rooms as
shown in Figure 19.11 [33]:
Queue 1
0
Doc 2
Entrance
Out: 0 Registraon
Reg Queue
0
Exit Doc 3
In: 0
Queue 2
0
Doc 4
A patient enters the emergency room and moves to the registration desk. If several patients
are already waiting, they have to remain in the queue in front until all patients that have
entered before them are done. Entering and leaving the queue are events that cause the
content of the affected objects to be increased or decreased and the patient to be moved from
one object to another. As soon as the patient is finished at the registration, several events
take place: the patient moves from the registration desk to the waiting area, so the state of
the server representing the registration switches back to idle. The content is 0: the server is
ready to accept the next patient in line. The waiting area is a queue in front of the casualty
ward; its content is increased by one as soon as the patient enters. If a doctor is free, the
next event is leaving the waiting area and entering the casualty ward. The server that rep-
resents the doctor is now set to “busy” until the treatment is finished and the next event –
leaving the casualty ward – takes place. Depending whether the patients require further
treatment, they move on either to the plaster room, the x-ray room or they are done and
leave the emergency room.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.5 Methods 599
Advantages
Discrete Event Simulation in general allows a very clear differentiation between the struc-
ture and dynamic behavior of a system. Reducing this dynamic behavior to a series of events
that take place at certain points in time makes this method of modelling a very useful ap-
proach for a wide variety of applications. The hierarchical structure of a system can be
easily described, and the concept of object-oriented modelling can be well integrated.
Disadvantages
The main problem with event-based modelling is the occurrence of events that take place at
the same point of time. This may quickly lead to a distortion of the course of events, based
on the wrong order of changes to the system. It is usually necessary to implement a method
to control the correct order of events to create a valid bepresentation of the system behavior.
19.5.4.1 Theory
Agent-based modelling is a relatively young discipline that has become possible with pow-
erful computers in the last decade of the 20th century.
What is an Agent?
An agent is an actor, an individual component that does something more or less inde-
pendently. It is not necessarily a simplified person.
Definitions in literature are diverging and often focused on usage of these agents in concrete
applications.
By combining the ideas in literature, an agent can generally be described as a model that
contains components that fulfil the following characteristics [34, 35, 36]:
Unique identification
Possession of individual properties that can change over time
Acting based on individual, adaptable rules
Possibility to communicate with the environment and other agents
Furthermore, it is necessary to provide an environment in which the agents exist.
Consequently, the freedom for developing agent-based models is gigantic and the variety of
different modelling ideas is enormous. This can be interpreted simultaneously as an advan-
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
600 19 Modelling and Simulation – Create your own Models
tage, with respect to flexibility of the modelling method, and as a disadvantage, with respect
to reproducibility and documentation of a model [37].
In contrast to cellular automata which were in principle developed for physical applications
and a basis for the historical developments, the origin of so-called agent-based models lies
within the social sciences. As an inconvenient result, there is no common base that allows
a formal definition of agent-based models. A lot of good literature is available providing
ideas and tutorials to establish an agent-based model for a given question [35, 38, 39]. Yet,
it is almost impossible to state a common denominator. Therefore, agent-based modelling
should rather be seen as a general concept than a modelling method.
Nevertheless, a short overview of the idea and the structure is given in the following part.
The three main characteristics of agent-based models are [34]:
Modelling of emerging behavior
Natural description of a system
Flexibility
Modelling of emergent behavior. The invention of cellular automata showed modellers that
even simple rules in such an automaton can lead to complex and impressive patterns. In
agent-based models such behavior can be observed even more prominently. Simple agents
with simple rules can result in a complex system behavior. While it is easy to describe these
rules it is hard and sometimes impossible to describe the resulting behavior precisely.
This is an important benefit because it gives us the possibility to model complex systems in
an easy way.
Natural description of a system. Agent-based models provide a natural description of the
real system. That means:
Agents in the model look like components in reality and act in a way that can be observed
in the real system.
Parameters in the model correspond with quantities that occur in the real system.
Flexibility. Especially important are these points:
Scalability: The number of agents can be increased without limitations, while the struc-
ture of agents is not affected.
Level of detail: The level of detail regarding how agents are modelled is generally not
restricted because it depends on the number and complexity of rules and attributes of
agents.
Flexible rules: Rules of agents can be adapted easily for the set-up and even during sim-
ulation without affecting the structure of other agents and of the environment.
Parameterization of agents can be set individually and can be changed any time during
simulation.
Technical Advantages
Possibility to model dynamic effects, whose dynamic behavior is only little-known, by
well-known rules.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 601
Flexible modelling environment that provides scalability for size and details in a simple
way.
Direct usage of data (parameters, relations between details). Ideally, no adjustment is
necessary.
Extension of agent-based models is often less work than extension of other model ap-
proaches.
Advantages in Application
Knowledge about mathematical theories is not necessary.
Model structure is very clear. Non-modelling project partners (economists, medics, and so
on) understand the model structure and can provide important input to the model.
Technical Disadvantages
Non-consideration of agent rules or wrong agent rules may result in incorrect global sys-
tem behavior.
Data for agents (parameters, rules) are sometimes not completely available. Then a com-
prehensive and often expensive calibration is necessary.
There are very limited mathematical methods for analyzing agent-based models. Often,
statistical evaluation of results is the only possibility.
The initial creation of an agent-based model can require more programming work than
other models.
Simulations with many agents often require powerful computer systems and long run-
times; availability of clusters may be a restriction.
Dangers in Applications
Agent-based models look very “real”. The feeling of having a correct model must not re-
place comprehensive validation.
Danger of wrong usage and interpretation by non-modelers because the model looks so
“real”.
19.6.1 D
ynamic Modelling of Railway Networks for Optimal Pathfinding
Using Agent-based Methods and Reinforcement Learning
This dynamic agent-based model tries to find the most time-efficient routes for the defined
agents; they need to find their way through a network with minimal disturbance of other
agents. It is based on the railway network operated by the Austrian Federal Train Agency
(ÖBB). The non-blocked tracks of the railway network obey an historical timetable. Due to
their dynamic nature during a typical time period, classical path finding algorithms such
as A* are not sophisticated enough to allow agents to find viable paths through the system.
Thus, a planning component for the agents was introduced, which is based on DynaQ+,
a reinforcement learning method.
Reinforcement Learning
The goal is to learn behavior patterns by maximizing rewards based on the actions taken by
the agents. In each node (or, using the terms of Reinforcement Learning, “state”), exists a
number of actions which indicate the next station the agent will arrive and the probability
of how favorable the actions in this specific state are. The aim of Q-Learning is to find the
optimal state-action policy for each state. This is done by updating Q values of one state-
action pair using the Bellman equation, which takes into account the values of the resulting
next state.
Here, Q(St, At) defines the Q-value of an agent in state St and choosing action At . Obviously,
the Q-value on the left-hand side of the equation presents the new calculated value, while
occurrence on the right-hand side defines the old value before it is updated. The term in the
brackets defines the Temporal Difference Error, which is composed of the expected optimal
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 603
reward and the current value. The learning rate α Î [0, 1] indicates how much the new value
affects the old one. DynaQ+, an extension of the well-known DynaQ specimen of reinforce-
ment learning algorithms, adds additional planning steps, which use already taken actions
under consideration of when they were performed in order to get to the optimal policy faster
and to enable agents to react to changes in their environment.
As always with tabular reinforcement learning methods, the table of Q-values grows with
the number of possible states. This means it is useful to approximate it with a neural net-
work, which leads us to Deep Reinforcement Learning. One of the main disadvantages of
this method is the dependence of the final result on the initialized Q-values. The structure
of the neural network is defined via layers, precisely an input H (0), output H (L) and n hid-
den layers. Due to the representation of the network as a graph, we are using Graph Con
volutional Networks, where the propagation layers are defined as
where H(l) indicates the l-th neural network layer, while W(l) is a weight matrix for them.
The term
indicates the normalization of the adjacency matrix  = A + I.
This neural network is trained with a pre-initialised Q-Network containing the shortest
paths through the infrastructure network. It thereby achieves a much faster learning rate
and, therefore, convergence for our reinforcement learning model.
Combining all these techniques allows us to teach our agent to find a path through the net-
work and to successfully arrive at its destination.
Purpose
The agent-based COVID-19 model aims to give ideas about the potential impact of certain
policies and their combination on the spread of the disease. In doing so, it supports decision
makers to correctly choose between possible policies by comparing the model outcomes
with other important factors such as socioeconomic ones. In order to fulfill this target, it is
relevant that the agent-based COVID-19 model validly depicts the current and near future
distribution and state of disease progression of infected people and their future behavior.
Overview
The system is based on the developments of DWH GmbH and Technische Universität Wien
in the field of infectious disease modelling and the developments of modular simulation
model concepts. The general point of view is depicted in Figure 19.14, which shows the ba-
sis of all infectious diseases: the population. Based on this population the social interaction
is a core part of the real-world behavior for spreading diseases. Based on these two models
the disease itself and additionally the economic and social impact have to be integrated.
The modular concepts require well defined interfaces. This takes time, but on the other
hand the modular blocks can be reused and can be validated one by one. Additionally, the
modules can be exchanged, for instance for other populations with very special social inter-
actions or if other economic aspects are in the focus [40].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 605
Economic Model
Populaon Model
β λ
Suscepble Infected Recovered
= −β = − λ = λ
To meet the real-world-needs a population interaction such as the one depicted in Figure
19.16 is integrated. In the model, regular contacts between agents occur via locations
(school classes, workplaces and households) and location-collections (schools, care homes),
while random leisure time contacts extend the standard contact network.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
606 19 Modelling and Simulation – Create your own Models
Figure 19.16 Contact network of agents in the agent-based COVID-19 model. Regular contacts
between agents occur via locations (school classes, workplaces and households), location-collections
(schools, care homes), while random leisure time contacts extend the standard contact network.
Infection
infected=true
susceptible = false
severity ∈ (undetected,
incubation latency
mild, severe, critical)
duration duration
Events
Branches
Infectious
infectious infectious=true Transition
duration Parallel process
Time delay
Finish Incubation
Update Infectiousness
infectiousness(t) yes
severity = undetected
no
ICU Bed
icu = true
icu
→ rehospitalisation
Normal Bed (critical)
icu = false
rehospitalisation
→ recovery
(critical)
Recovery / Removal
infected=false
quarantined = false
hospitalised = false
confirmed = false
severity = null
Figure 19.17 State chart of the patient pathway of a person-agent in the agent-based COVID-19
model. Only those state variables that are changed by the corresponding event are labelled, all
others remain at the current value. The initial state of all infection-specific state variables is false
or null, except for “susceptible”, which is initially true.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
608 19 Modelling and Simulation – Create your own Models
Stochasticity
Basically, all model processes, including the initialisation, include the sampling of random
numbers. Therefore, Monte Carlo simulation is applied, results of runs are averaged and
also their variability is assessed. Aside from being time-consuming to smoothe, the stochas-
ticity of the model is actually its key strength. It allows one to model heterogeneity and
skewness of the infection-network, which distinguishes the model from classic macroscopic
approaches. This specifically refers to the means by which contacts are modelled: Since the
person-agent’s contact rate is initially drawn from a Gamma distribution, the contacts sam-
pled via Poisson distribution result in a so-called Gamma-Poisson mix, which is by defini-
tion Negative-Binomial distributed. This strategy allows us to directly parametrise the skew-
ness of the contact network in accordance with published information on the dispersion
factor of COVID-19 clusters.
Parametrization
With respect to parameterization, we distinguish between model input and model parame-
ters. Classical model parameters specify scalar or array-typed model variables that are ini-
tialised at the beginning of the simulation and, if not changed by certain model events, keep
their value for the entire simulation time. Examples are the infection probability of the
disease, the age-dependent death rate of the population, or the distribution parameters of
the recovery time.
In contrast to model parameters, the model input consists of an event-timeline that de-
scribes at which point in time a certain incident changes the behavior of the model. This
incident usually refers to the introduction of a policy, like the closure of schools or the start
of tracing but may also refer to instantaneous changes of model parameters which are re-
lated but cannot be directly attributed to policies, such as the increase of compliance among
the population to increased hygiene recommendations.
Calibration
Clearly, there is no valid data available for direct parameterization of the base infection
probability parameter, which is the most fundamental of the three factors that decide about
a transmission in case of a direct contact. First of all, this parameter is hardly measurable
in reality and moreover strongly depends on the definition of “contact”. Consequently, this
parameter needs to be fitted in the course of a calibration loop.
The calibration experiment is set up as follows:
We vary the parameter “infection probability” using a bisection algorithm.
For each parameter value, the simulation, parameterized without any policies, is executed
ten times (Monte Carlo simulation) and the results are averaged.
The average time-series for the cumulative number of confirmed cases is observed and
cropped to the value as at the initial upswing of the epidemic curve: to be specific, all
values between 200 and 3200. In this interval the growth of the curve can be considered
as exponential.
The cropped time-series is compared with the corresponding time-series of real measured
data in Austria.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 609
Both time-series are compared regarding the average doubling time of the confirmed
cases. The difference between the doubling times is taken as the calibration error for the
bisection algorithm.
Model Implementation
The simulation of ABMs such as the specified agent-based COVID-19 model is a huge chal-
lenge with respect to computational performance. Because the model cannot be scaled
down, almost 9 million interacting agents just for Austria need to be included into the model
to simulate the spread of the disease in the entire population.
These high demands exclude most of the available libraries and software for ABM, including
AnyLogic, NetLogo, MESA, JADE, or Repast Simphony [42-46]. Most of these simulators
cannot be used, as their generic features for creating live visual output generates too many
overheads.
Thus, it was decided to use a self-developed agent-based simulation environment ABT
(Agent-Based Template [41]), realized in 2019 by dwh GmbH in cooperation with TU Wien.
The environment is implemented in JAVA and specifically designed for supporting a repro-
ducible simulation of large-scale agent-based systems.
Model Output
The outcomes of the model are time series with a daily time basis. They consist of aggre-
gated numbers describing the current nation- and/or region-wide spread of the disease as
well as numbers depicting the contact behavior of agents. These include, for example, the
cumulative number of confirmed cases, the number of currently active asymptomatic cases,
the total number of daily newly infected 10- to 30-year-old females, the total number of daily
contacts for school children, or the average number of secondary infections per agent.
The calculated results of the model, together with additional scenarios and extensions, were
used as a decision support for the Austrian COVID consortium and are, together with other
data, the basis for the COVID rules for the public.
Forecast
Consider the current stock information, the supplier plans and delivery replenishments,
which may be pulled by the customer from the stock at any point in time: There could be
instances when the pull from the customer increases without being forecasted, resulting in
a stock-out situation. Without VMI modelling, the supplier would have been held responsi-
ble for a failed delivery. To better assign responsibilities for any kind of stock violation, a
root-cause enabling VMI performance measurement approach is used which follows the
scheme below.
True
WP ≥ WPTarget None
False
True
FA ≥ FA Target Supplier
False
True
BFA ≥ FA Target Supplier
False
Excepons
Exist
False True
Customer Supplier
Figure 19.19 Configuration for determination of the root-cause during VMI Performance
Measurement
One of the main KPIs used is the overall weekly performance, WP, which is compared to
target weekly performance .
NV, OS, US, and SO represent no-violation, over-stock, under- stock and stock-out. To reflect
the severity of the violation, certain weights are further added to the formula.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 611
Taking this scheme as a basis, a Deep Reinforcement Learning (RL) approach was taken to
calculate optimal replenishment quantities.
A compact description of Reinforcement Learning can be found in the use case ‘Dynamic
modelling of railway networks’, as well as in Chapter 8.
For solving the underlying Markov Decision Problem in this case, a deep Q-Network (DQN)
that combines RL with deep neural networks was taken. The respective action space, state
space and reward function were defined as follows:
State space: One unique state which is a normalized value between −1 and +1.
FSP is the anticipated stock position, Zw/zwF the anticipated maximum/minimum target
stock level, MF the mean value of these and DTMF the anticipated distance to the mean.
Action space: This space is discrete and has 9 (0 → 8) possible values, where the replenish-
ment policy is a function and m is the magnitude of action. As an example,
a replenishment amount of 7500 units corresponds to the value a=4 and m=2500.
Reward function: This uses the current stock position CSP as a main driver and assigns a
value between –1 and +1 as reward/penalty. Zw/zw is the maximum/minimum target stock
level, M the mean value of these and DTM, the distance to the mean.
Different scenarios were used to evaluate the performance of this model. Scenario 1 used
real data and randomly selected product type and customer. For Scenario 2 and 3 training
sets representative of the variables in Scenario 1 where generated. Both used the Martin-
gale Method of Forecast Evolution as a method for calculating random demand and forecast
and Scenario 3 enhanced it by introducing sporadic rise and fall for these variables. All
these scenarios were tested on real data.
It was shown that it is possible to optimize the replenishment policy by implementing this
DRL algorithm. The most significant result was that the percentage of “no-violation” inven-
tory status improved significantly (from 43% to 99% in Scenario 3). This came with in-
creased transportation costs due to a rise in the number of total shipments.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
612 19 Modelling and Simulation – Create your own Models
19.6.4 F
inding Feasible Solutions for a Resource-constrained Project
Scheduling Problem with Reinforcement Learning and Implemen
ting a Dynamic Planing Scheme with Discrete Event Simulation
Figure 19.20 Example of a dependency tree for schedule generation. The colors signify various
processes that happen in a specific area of the train. Squares are processes, diamonds are mile-
stones.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
614 19 Modelling and Simulation – Create your own Models
A working plan assigns a start-time tstart(p) Î [Tstart, Tend] and workers Ri(p) Í Ri for every
i Î RT, to every p Î P. The working plan is feasible if, with
tend(p) = tstart(p) + dur(p) the time when process p is finished,
P open(t) = {p Î P | t < tstart(p)} open processes at time t,
Prun(t) = {p Î P | t Î [tstart(p), tend(p))} processes that are currently being worked on
at time t,
Pdone(t) = {p Î P | tend(p) ≤ t} processes that are finished at time t,
the following constraints hold at all times t Î [Tstart, Tend]:
(1)
Prun(t) : |Ri (p)|≥ ni (p) (2)
Ri : |{p Prun(t): r Ri (p)}| ≤ r(t) (3)
(4)
(5)
(6)
Among all feasible plans, we want to identify those that are optimal with respect to the fol-
lowing goals:
Goal 1: Minimize
which is to say, minimize the makespan. This means the end-time of the last task that will
be processed.
Goal 2: Find such that b1 + · · · + b|RT |, or some cost function depend-
ing on the number of resources, is minimal, and such that for all t and all i Î RT, we have
That is, we want to reduce the number of resources per type in a way that still allows finish-
ing all activities in the predefined makespan [Tstart, Tend].
As already mentioned, these types of problems are computationally expensive to solve
directly. During the research phase for finding a suitable representation of the problem, we
employ constraint programming to solve smaller variants of it. Nevertheless, scalability
issues quickly arise.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
19.6 Modelling and Simulation Examples 615
Reinforcement learning is applied for the target of minimizing the makespan of the prob-
lem, while a separate tree-based search is used for solving the resource minimization prob-
lem.
Reinforcement Learning
Reinforcement Learning is a rapidly developing field of artificial intelligence, which em-
ploys various techniques to iteratively learn how to maximize some problem-specific reward
function. We’ve already seen two specific examples of reinforcement learning in Section
19.6.1 and Section 19.6.3, and a detailed yet general discussion can be found in Section
12.8. So to summarize the concept here in simple terms, there is an environment with a
predefined state, some predefined actions on this state, and a reward for actions taken. To
maximize the reward, a so-called “agent” has to learn to balance the need for exploration
of the state/action space with the exploitation of known actions and rewards.
In this case the system is rewarded for building a feasible schedule. The reward is higher if
the makespan is lower. Motivated by the reward function described in [53], the model learns
to generate a number of feasible and optimal schedules.
Tree-based search
After calculating the minimal makespan, the available resources in multiple dimensions are
gradually reduced until a solution is found that still satisfies the maximum makespan re-
quirement, but needs the least amount of resources. Thanks to the tree-based search ap-
proach, the number of iterations needed can be drastically reduced.
Figure 19.21 Sample Chart of occupied locations during a schedule. The X-Axis signifies the date,
the Y-Axis the location.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
616 19 Modelling and Simulation – Create your own Models
Simulation
In real life, uncertainties have to be dealt with quite often. Processes may be finished later
than planned, resources are missing on short notice, and all sorts of obstacles arise, which
cannot be planned for beforehand. Thus, one of the goals in this example is to accommodate
a prospective planner with performance figures on the stability of schedules. To do this, we
create a discrete event simulation, which incorporates such stochastic variations, to mea-
sure the impact of disturbances of the system. This allows us to estimate the impact of single
processes, and so find bottlenecks in the system. Surprisingly, we find that the bottleneck
is often not the resource, but rather the occupied location, which causes significant delays
in the schedule.
■ 19.8 In a Nutshell
Modelling and simulation provide additional strategies and methods for modern
data science; they add techniques to integrate real world system behavior and
therefore a broad spectrum for scenario calculation and answering “What if . . .”
questions. Modelling is no encapsulated method, it uses various data
techniques for parametrization, calibration, validation as well as for scenario
definition.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 617
References
[1] Bridges, T. J., and S. Reich (2006). Numerical methods for Hamiltonian PDEs. Journal of Physics A:
Mathematical and General, 39(19), 5287–5320. https://doi.org/10.1088/0305-4470/39/19/s02
[2] Nandakumaran, K., and P. S. Datti, Partial Differential Equations: Classical Theory with a Modern
Touch, Cambridge University Press, 348, May 2020.
[3] Großmann, C., and H.-G. Roos: Numerische Behandlung partieller Differentialgleichungen. Teub-
ner, 3. Auflage, 2006.
[4] Leveque, R. J.: Finite Volume Methods for Hyperbolic Problems. Cambridge University Press,
2002.
[5] Hackbusch, W.: Iterative Lösung großer schwachbesetzter Gleichungssysteme. Teubner, 1991.
[6] Murthy, D. N. P., Page, N. W., and E. Y. Rodin, Mathematical modelling: a tool for problem solving
in engineering, physical, biological, and social sciences. Oxford: Pergamon Press, 1990.
[7] Bender, E. A., An Introduction to Mathematical Modelling. New York: Wiley, 1978.
[8] “Zielgerichtete mathematische Modellbildung – Werkzeuge für die Praxis”, in: Dynamische Pro
bleme-Modellierung und Wirklichkeit, Univ. Hannover, 1996, pp. 1–29.
[9] Zeigler, B. P., Theory of Modeling and Simulation. New York: Wiley, 1976
[10] Freire, J., Bonnet, P., and D. Shasha (2012). Computational reproducibility: state-of-the-art, chal-
lenges, and database research opportunities. In: Proceedings of the 2012 ACM SIGMOD Interna-
tional Conference on Management of Data (pp. 593–596). https://doi.org/10.1145/2213836.2213908
[11] Popper N., 2015 Comparative Modelling and Simulation – A Concept for Modular Modelling and
Hybrid Simulation of Complex Systems Systems (Doctoral dissertation). Vienna University of
Technology, Vienna, Austria.
[12] Sargent, R. (2010). Verification and validation of simulation models. In: Proceedings of the 2010
Winter Simulation Conference (pp. 166–183). Baltimore, MD.
[13] Balci, O. (1994). Validation, verification, and testing techniques throughout the life cycle of a sim-
ulation study. Annals of Operations Research, 53(1), 121-173. http://doi.org/10.1007/BF02136828
[14] Balci, O. (1997). Verification, validation and accreditation of simulation models. In: Proceedings of
1997 Winter Simulation Conference (pp. 135–141). Atlanta, GA, USA.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
618 19 Modelling and Simulation – Create your own Models
[15] Breitenecker, F., Emergency Department – Follow-up Treatment – ARGESIM Comparison 6 – Defi-
nition. SNE Simulation Nodes Europe. 1992; 2(3):30.
[16] Grimm, Volker & Berger, Uta & Bastiansen, Finn & Eliassen, Sigrunn & Ginot, Vincent & Giske, Jarl
& Goss-Custard, John & Grand, Tamara & Heinz, Simone & Huse, Geir & Huth, Andreas & Jepsen,
Jane & Jørgensen, Christian & Mooij, Wolf & Müller, Birgit & Pe’er, Guy & Piou, Cyril & Railsback,
Steven & Robbins, Andrew & Deangelis, Donald. (2006). A Standard Protocol for Describing Indi-
vidual-Based and Agent Based Models. Ecological Modelling. 198. 115–126. 10.1016/j.ecolmodel.
2006.04.023.
[17] North, M. J., and C. M. Macal, Managing Business Complexity: Discovering Strategic Solutions
with Agent-Based Modeling and Simulation. New York: Oxford University Press, 2007.
[18] Sargent, R. G., “Verification and Validation of Simulation Models,” in: Proceedings of the 2007
Winter Simulation Conference, Washington, D. C., USA, 2007, pp. 124–137.
[19] Law, “How to Build Valid and Credible Simulation Models,” in: Proceedings of the 2009 Winter
Simulation Conference (WSC), Austin, TX, USA, 2009, pp. 24–33.
[20] Weinstein, M. C., et al., “Principles of Good Practice for Decision Analytic Modeling in Health-Care
Evaluation: Report of the ISPOR Task Force on Good Research Practices-Modeling Studies,” Value
in Health, vol. 6, no. 1, pp. 9–17, Jan. 2003.
[21] Balci, “Verification, Validation and Testing,” in: Handbook of Simulation: Principles, Methodology,
Advances, Applications, and Practice, J. Banks, Ed. Hoboken, NJ, USA: John Wiley & Sons, Inc.,
2007.
[22] W. Wing-Yi Chan, A Survey on Multivariate Data Visualization. Hong Kong: Department of Com-
puter Science and Engineering, Hong Kong University of Science and Technology, 2006, pp. 1–29.
[23] Ayres, F., heory and Problems of Differential Equations. New York: McGraw-Hill, 1952.
[24] Kermack & McKendrick (1927) A contribution to the mathematical theory of epidemics. Proceed-
ings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical
Character. The Royal Society 115(772): 700-721. Available at: http://dx.doi.org/10.1098/rspa.1927.0118.
[25] Forrester, J. W. (1958). Industrial Dynamics – a Major Breakthrough for Decision Makers. Harvard
Business Review, 36(4), 37–66.
[26] Yearworth, M. (2014). A Brief Introduction to System Dynamics Modelling (p. 15). University of
Bristol. 24 October 2014
[27] Freeman, R., Yearworth, M., Angulo, M., and T. Quested (2013). Evidence building for waste pre-
vention: understanding the causal influences that result in waste. Paper presented at the 31st In-
ternational Conference of the System Dynamics Society, Cambridge, Massachusetts USA.
[28] Pejic Bach, Mirjana & Tustanovski, Emil & Ip, W. H. & Yung, Kai & Roblek, Vasja. (2020). System
dynamics models for the simulation of sustainable urban development: A review and analysis and
the stakeholder perspective. Kybernetes. 49. https://doi.org/10.1108/K-04-2018-0210.
[29] Brailsford, S. C., “System dynamics: What’s in it for healthcare simulation modelers,” 2008 Winter
Simulation Conference, 2008, pp. 1478–1483, doi: 10.1109/WSC.2008.4736227.
30] Einzinger, P., 2014. A Comparative Analysis of System Dynamics and Agent-Based Modelling for
Health Care Reimbursement Systems (Doctoral dissertation). Vienna University of Technology,
Vienna, Austria.
[31] Rahman, N. (2014). A System Dynamics Model for a Sustainable Fish Population. International
Journal of Technology Diffusion, 5(2), 39–53. doi:10.4018/ijtd.2014040104
[32] Zeigler, B. P., “DEVS Today: Recent Advances in Discrete Event-Based Information Technology,” in
11th IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer Tele-
communications Systems, 2003. MASCOTS 2003., Orlando, FL, USA, 2003, pp. 148–161
[33] Rahmi, S. M., “C6 Emergency Department: Follow-up Treatment - Taylor ED,” Simulation News
Europe SNE, vol. 10, no. 2–3, p. 33, Dec. 2000.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 619
[34] Bonabeau, E., “Agent-based modeling: Methods and techniques for simulating human systems,”
Proceedings of the National Academy of Sciences, vol. 99, no. 3, pp. 7280–7287, May. 2002.
[35] Macal, C. M., and M. J. North, “Tutorial on Agent-Based Modeling and Simulation, Part 2: How to
Model with Agents,” in: Proceedings of the 2006 Winter Simulation Conference, Monterey, Cali-
fornia, pp. 73–83.
[36] Wooldridge, M., “Agent-based software engineering,” IEE Proceedings - Software Engineering, vol.
144, no. 1, p. 26, 1997.
[37] Bicher, M., Urach, Chr., and N. Popper. GEPOC ABM: A Generic Agent-Based Population Model for
Austria. In: Proceedings of the 2018 Winter Simulation Conference, pp. 2656–2667, Gothenburg,
Sweden, 2018. IEEE.
[38] Railsback, S. F., and V. Grimm (2012). Agent-based and individual-based modeling: a practical in-
troduction. Princeton Univ. Press, Princeton. OCLC: 811181165.
[39] Epstein, J. M. (2012). Generative Social Science. Princeton University Press, Princeton. OCLC:
956983748.
[40] Miksch, F., Mathematical Modeling for New Insights into Epidemics by Herd Immunity and Sero-
type Shift. ASIM Fortschrittsbericht. doi: 10.11128/fbs.20 (2016).
[41] Mossong, J., Hens, N., Jit, M., Beutels, Ph., Auranen, K., Mikolajczyk, R., Massari, M., Salmaso, St.,
Scalia Tomba, G., Wallinga, J., et al. POLYMOD social contact data (2017)
[42] Beate Jahn, Gaby Sroczynski, Martin Bicher, Claire Rippinger, Nikolai Mühlberger, Júlia Santa
maria, Christoph Urach, Michael Schomaker, Igor Stojkov, Daniela Schmid, Günter Weiss, Ursula
Wiedermann, Monika Redlberger-Fritz, Christiane Druml, Mirjam Kretzschmar, Maria Paulke-
Korinek, Herwig Ostermann, Caroline Czasch, Gottfried Endel, Wolfgang Bock, Nikolas Popper,
and Uwe Siebert. Targeted covid-19 vaccination (tav-covid) considering limited vaccination capac-
ities—an agent-based modeling evaluation. Vaccines, 9(5), 2021.
[43] Juan Ignacio Latorre Jimenez. EUROSIM 2019 Abstract Volume. In: EUROSIM 2019 Abstract Vol-
ume. ARGESIM, 2019.
[44] Lauer, St. A., Grantz, K. H., Qifang Bi, Forrest, K. J., Qulu Zheng, Meredith, H. R., Azman, A. S., Re-
ich, N. G., and J. Lessler. The incubation period of coronavirus disease 2019 (COVID-19) from pub-
licly reported confirmed cases: estimation and application. Annals of internal medicine, 2020.
[45] David Masad, D., and J. Kazil. Mesa: an agent-based modeling framework. In: 14th PYTHON in
Science Conference, pp. 53–60, 2015.
[46] Makoto Matsumoto and Takuji Nishimura. Mersenne twister: a 623-dimensionally equidistributed
uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simu-
lation (TOMACS), 8(1):3–30, 1998.
[47] Vendor Managed Inventory (2021): Wikipedia, https://en.wikipedia.org/w/index.php?title=Vendor-
managed_inventory&oldid=1054860363
[48] Bullwhip effect (2021): Wikipedia, https://en.wikipedia.org/w/index.php?title=Bullwhip_effect&ol
did=1054327373
[49] Afridi, Nieto-Isaza, Ehm, Ponsignon, and Hamed. “A DEEP REINFORCEMENT LEARNING AP-
PROACH FOR OPTIMAL REPLENISHMENT POLICY IN A VENDOR MANAGED INVENTORY SET-
TING FOR SEMICONDUCTORS.” In Proceedings of the 2020 Winter Simulation Conference
[50] Artigues, Resource-Constrained Project Scheduling.
[51] Hartmann und Briskorn, “An Updated Survey of Variants and Extensions of the Resource-Con-
strained Project Scheduling Problem”.
[52] NP-Hardness: Wikipedia, https://en.wikipedia.org/wiki/NP-hardness
[53] Zhao u. a., “A Deep Reinforcement Learning Approach for Resource-Constrained Project Schedul-
ing”.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Data Visualization
Barbora Antosova Vesela
Humans have a natural need to understand the environment that surrounds them. Any
contact with this external environment is mediated by basic senses such as sight, hearing,
smell, taste and touch. Each of these senses has a unique ability to receive a specific type of
information. The transmission of information is unevenly scattered between them: at most,
up to 80 %, visual perception is involved [30].
In an effort to capture this information, individual data is collected by observing, measuring
or recording action. The data is used to describe an observed phenomenon, object or p roperty.
Data visualization combines both collected data and the advantages of visual perception. It
creates a graphical representation of data using various types of charts, maps, diagrams
and histograms, which help the observer to faster and better understand the observed
phenomenon, object or property. It represents an analytical approach.
Nowadays a trend of digitalization is permeating our society. Although we are inundated
with a large quantity of data, we often see only fragments of individual events and must put
them together to obtain an overview of the situation. It is becoming increasingly difficult to
distinguish important from unimportant information. It is even impossible to see a whole
image of the situation.
Information is processed in our brain differently in conscious versus subconscious pro-
cesses. At a conscious level, deliberation is based on serial computation with the use of
logical, rational and syntactic rules. On the other hand, subconscious mechanisms use more
parallel computation. People often attribute their and others’ decisions to just simple argu-
ments, but in fact, behind these decisions lie complex integrations of small pieces of infor-
mation, gathered during entire life experiences [1].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
622 20 Data Visualization
When a certain topic has been quantitatively researched, data visualization offers insights
into that issue and helps us to process the information better. Even large, high-dimensional
data can be processed in an easily understandable way by the human brain [2]. With less
effort we capture the information in a shorter time. Visualization brings us new informa-
tion, allowing us to see hidden relationships and representations. It forces us to ask better
questions and helps us make better decisions. It tells us a story of yesterday and today, and
predicts a story for tomorrow.
Data visualization is not just nice to have, it is a must. Even a simple chart can help us more
quickly understand a scenario and better decide the next action items. On the other hand,
sophisticated data visualization can be a masterpiece. Its main purpose is to attract atten-
tion and deliver the information. This requires a deep dive into a problem we are trying to
solve; it requires we think broadly about the problem set in the specific situation and envi-
ronment from different perspectives. The combination of a creative, logical and critical
thinking is necessary. Both options, a simple chart and sophisticated data visualization, can
be comparably powerful.
■ 20.1 History
The effort to capture quantitative information into a visual form has a long history, starting
with the first cartographic maps and graphics in statistics belonging to the ancient period.
The predominant interest was in astrology and cartography. Due to the need to orientate,
navigate and explore the surrounding environment, the first cartographic maps were
created. The work of Claudius Ptolemy was particularly important: in his “Guide to Geogra-
phy”, he introduced a manual for compiling maps with the help of astronomical coordinates
mostly processed in tabular form. His mathematical model of sonar system, where the Earth
is stable and other bodies orbit it, has been considered correct for fifteen centuries [5].
The oldest known example of capturing variables graphically comes from the 10th century,
and is reproduced in Figure 20.1. This anonymous graph shows the inclinations of the orbits
of the planets as a function over time. The background is covered by a grid, the vertical axis
represents the location, and the horizontal axis represents the time. This graph is a part of
a manuscript which seems to have been used in monastery schools [6 – 9].
A further advancement of data visualization occurred in the 16th century. This was a period
full of invention, research and exploration, considered as the beginning of modern science.
With scientists were working in different areas simultaneously, diverse fields such as natu-
ral science, mathematics, engineering, astronomy, navigation and geography were devel-
oped. New techniques and instruments for observation and measurements were invented.
Particularly important was the development of triangulation, a technique which involves
applying multiple data collection techniques in order to analyze the results of any given,
individual study. Its aim is to remove the weaknesses of individual methods, which, if used
separately, would cause certain aspects of the topic to be overlooked [10, 8, 11]. During this
period, the first trigonometric tables by Georg Rheticus and the first modern cartographic
atlas “Theatrum Orbis Terrarum” (Theatre of the World) by Abraham Ortelius were created.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.1 History 623
The atlas contained the majority of the then-available knowledge about and maps of the
world. It was the first attempt to gather geographic knowledge in one volume [8, 9, 12].
Figure 20.1 An anonymous graph from the 10th century showing the inclinations of the orbits of
the planets as a function of time [7]
The 17th century saw continued interest in measurements of physical variables such as
time, distance and space, leading to further developments in astronomy, map making, nav-
igation, mathematics, probability theory and statistics. Theories of errors of measurement
and estimation were developed, and studies of population, agricultural land, wealth and
goods were made. The first visual representation of statistical data, created by Michael
Florent van Langren and titled “1644” shows estimates of the difference in longitude be-
tween Toledo and Rome. It contains the names of the astronomers who provided the estima-
tion, and an arrow representing the true distance (16° 30’), which was not known at that
time. This visualization is considered as the earliest example of the principle of “effect
ordering for data display” [8, 9]. See the depiction in Figure 20.2.
Figure 20.2 “1644”, by Michael Florent van Langren, showing estimates by various astronomers of
the difference in longitude between Toledo and Rome [8]
“1644” was not the only data visualization first to occur in the 17th century. In this time, we
also see: the first graph of a continuous distribution function from Gaunt’s table based on
the bills of mortality, the first representation of a theoretic curve connecting barometric
pressure to altitude, and the first weather map capturing winds [8].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
624 20 Data Visualization
The greatest flourishing of the field was done in the 18th and 19th centuries, nowknown as
a period of new graphic forms. In cartography, isolines, contours, and isogons to show mag-
netic declination were all developed, in order to visualize physical properties in addition to
geographic data.
Also for the first time, geometric objects such as squares and rectangles were used to com-
pare the areas of European states. New methods such as curve fitting and interpolation were
developed. Meanwhile, the invention of three-color printing helped attract more attention to
the rapidly developing field of data visualization.
During this period, the line graph, bar chart and pie chart were all invented by Willian
Playfair, considered to be the father of information design. He used two vertical scales for
two different variables: population and taxes. He created “The Commercial and Political
Atlas”, which includes 43 time series plots and one column chart. Figure 20.3 shows one of
these time series plots.
Figure 20.3 William Playfair’s time series graph providing information about exports and imports of
England to and from Denmark and Norway between 1700 to 1780 [13]
Other new creations from this time include histograms and scatterplots were created. Maps
were collected in atlases showing economic, social, medical and physical topics. An example
of medical visualization is the dot map created by Dr. John Snow, showing deaths due to
cholera clustered around the Broad Street water pump in London, see in Figure 20.4 [14].
Printed coordinate paper was patented by Dr. Buxton.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.1 History 625
Figure 20.4 Graphical depiction of cholera deaths clustered around Broad Street pump in London.
The pump is highlighted with blue color [14]
Also important was the work of Charles Joseph Minard, who created multiple useful visual-
izations, the most famous of which being a graphic describing the progress of Napoleon’s
soldiers in Russia. It represents a complex map using various variables such as tempera-
ture, geography, historical context, and loss of life of Napoleon’s soldiers at a time and loca-
tion. See the depiction in Figure 20.5. Minard was the first to use bar charts as proportional
symbols in the legend for better understanding the map [15, 16].
Figure 20.5 Minard’s graphical depiction describing the progress of Napoleon’s soldiers in Russia
[19]
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
626 20 Data Visualization
Florence Nightingale invented a polar area chart, so called “rose diagram”, see in Figure 20.6.
Luigi Perozzo created one of the first 3D data representations that shows the relationship
between three variables. The visualization shows age among the Swedish population. The
horizontal axis represents years, the vertical axis shows the number of people, and the third
coordinate shows age groups [18, 9]. See the depiction in Figure 20.7.
Figure 20.7 Three-dimensional data depiction created by Luigi Perozzo showing ages among the
Swedish population [18]
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.2 Which Tools to Use 627
The first half of the 20th century can be considered as a modern dark period. The invention
of new graphic forms slowed. Focus was on the popularization of data visualization, which
became mainstream. Graphical methods crucially contributed to new insights and theories
in natural sciences.
The second half of the 20th century was affected by the rapid development of computers.
With the creation of the first high-level computer language, Fortran, computers began to be
used for processing statistical data. It was possible to collect and store larger volumes of
data, process it easily, and create visualizations quickly. The new libraries specialized for
work with data were created. At this point, computer science research began.
Later in this period, researchers like John W. Tukey in statistics and Jacques Bertin in
cartography developed the field of information visualization. The first full color computer
visualizations, and new interactive graphic software, were created. Important was the work,
“The Visual Display of Quantitative Information”, written by Edward Tufte. Methods for vi-
sualization of discrete and categorical data were reinvented. Various new, multidimensional
reduction techniques such as biplot and correspondence analysis were developed and older
techniques such as principal component analysis, multidimensional scaling and discrimi-
nant analysis were used. New visualization methods such as the association plot, mosaic
plot and sieve diagram were developed [9, 8].
Nowadays, data visualization penetrates into various fields such as medicine, natural
sciences, sport industry, economy, business and others. The processing of data has become
a big topic.
We have to know our data. Often, we deal with heterogenous data obtained from various
sources captured in different times. This can lead to a pipeline of errors, starting with a
misleading data visualization and continuing with information misinterpretation, ending
with making ill-informed or simply wrong decisions. To avoid this situation, we need to be
careful and thorough when gathering this data.
We must also decide how to manage our data. Various tools may support various ways of
working with data. A special interest must be paid to data protection, uploading inputs and
downloading outputs. That is to say, data protection is a must. Visualization tools have to
support a possibility to log in and allow access only for authorized users. The application
should provide an option to assign users various roles with different levels of right based on
their needs and power. Often, read only and edit functions are available.
Uploading inputs is often supported in three main ways. One way is to upload our data
directly to the application. This method is simple, but if we deal with a large amount of data,
it can be slow. The tool providers also often prohibit this method. Furthermore, this option
does not provide full protection and control over data.
The second way is to upload our data to a cloud. It is a space on multiple, remote servers,
delivered on demand with just-in-time capacity and costs. The data storage is managed and
operated by a cloud computing provider as a service. On the one hand, this method saves on
hardware costs, but on the other hand, it increases dependencies on third party companies.
Nevertheless, it is becoming relatively popular [3].
The third way is to store data in our own servers and hardware. This method is relatively
expensive, but it provides full control and protection over data. It brings the company inde-
pendence from third party providers.
The tools usually support various input formats, and we must ensure that the format we
deal with is among them. Otherwise, we need to convert our data to the supported format.
It is important to know which output formats the tool offers, and choose an appropriate one
for our purpose.
We should consider how we would like to use the application. Basic tools can be suitable for
short time simple solutions, while for long-term solutions, where continuous data collection
takes place, more complex tools with support functions for scalability and iterative process-
ing are needed.
The collection of a large amount of data is time consuming. Data scientists enjoy to see the
results unfolding progressively. Often collection step goes together with the processing of
currently acquired available data. It brings the opportunity to view the current computa-
tional status and start making some decisions before the whole computing is finished [4].
Support for the integration of visualization tools in an analytical workflow is necessary [4].
Otherwise, we will need to invest more time, energy and money to adjust our current work-
flow.
A tool may be used differently by individual users and teams. A large number of companies
emphasizes teamwork collaboration. The tool should react to this and support the function-
ality of working in a team.
Developers welcome the possibility, which some applications allow, to slightly modify them
and extend them according to their needs. This option enables developers to set up the tool
exactly for their purpose.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.3 Types of Data Visualizations 629
Visualization tools offer different levels of assistance: from the basic functionality to the
more complex. The tools often offer an interactive environment which allows faster orienta-
tion in a visualization.
In addition, advanced applications could provide recommendations for users regarding
which graphical depiction to use for the data or which questions should be asked. The tool
could not only analyze data from the past and today, but also give predictions for the future,
based on the processing of historical data supported by an implementation of artificial intel-
ligence and other machine learning algorithms in its workflow.
In summary, Figure 20.8 shows six main areas which should be addressed during the selec-
tion of a data visualization tool, and suggests questions that can be asked during this
process.
Figure 20.8 Six main areas and related questions to consider when selecting a data visualization
tool
At the end, the kind of tool we choose is up to us. It is necessary to test a tool in advance and
find out which suits us the best. We take into account whether we will use the tool for a long
or short time period.
testing data; in modelling, to view a neural network’s performance and final evaluation. The
possibility to explore and visualize data increases the number of companies which use the
power of business intelligence to drive their business. It helps them understand customer
behavior, identify areas where improvement or modification of a product is needed, and
predict future trends in a market.
For various applications, different data visualization types are used. In the following, we
provide a short insight into the elementary types of visualizations.
The scatter plot is a graphical depiction of quantitative data, which are represented by
points in a space described by Cartesian coordinates. Each axis corresponds to the value of
one variable. Usually, the horizontal coordinate is determined as an independent and verti-
cal as a dependent variable. Thus, typically, a scatter plot works with two variables. How-
ever, additional variables can be added in the form of different colors, hues, shapes or sizes
of the data points. Beside the individual data points, a scatter plot allows us to explore pat-
terns in data, such as missing data and outliers. Sometimes a trend line, mathematically
calculated as the best fit to the data, is added to the plot to show the nature of variables’
relationships to one another: this can be linear or nonlinear, positive or negative, weak or
strong. See examples of scatter plot in Figure 20.9.
Text
Figure 20.9 Left: a scatter plot showing two variables without a relationship. Right: a scatter plot
with a red trend line showing a strong negative linear relationship between variables
A line chart, as shows in Figure 20.10, is a graphical depiction using points representing
quantitative data connected by line from left to right, showing local change in value. Typi-
cally, the independent variable with continuous progression measured in regular intervals,
such as units of time, is placed on the horizontal axis; the dependent variable of interest is
represented on the vertical axis. The chart shows a trend in data. It also allows the display
of multiple data sets, using a multiline option in which each data set corresponds to one
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.3 Types of Data Visualizations 631
line. Often, the lines are differentiated by color or hue. It is also possible to add uncertainty
or variability of data to the plot.
Text
Figure 20.10 Left: a line chart between two variables. Right: the line chart with uncertainty added
As depicted in Figure 20.11, an area chart is a special type of line chart displaying quanti-
tative values of data over an interval or period of time, using colored or textured area below
the line, placed in Cartesian coordinate grid. The x coordinate should be always equal to
zero. The graph provides a quick impression about the trends and quantity of data over
time, and is designed to attract attention [22].
Figure 20.11
The figure shows an example of Area chart
Column and bar charts are the graphical depictions of quantitative information with the
help of columns or bars, which are placed in relation to a horizontal and vertical coordinate.
Whereas a bar chart is oriented horizontally, the column chart is oriented vertically. The
length of each column or bar represents the proportional number of values in a correspond-
ing measured class. Both charts are used to visualize categorical, ordinal and nominal data,
and two or more values are compared. A special case is a double column and double bar
chart, which can be used to show multiple sets of data in one plot. There are two variants:
side-by-side and stacked. Side by side is powerful in the comparison of two data sets,
whereas the stacked variant is used to show the total of two data sets [20]. See examples of
a column chart in Figure 20.12 and a bar chart in Figure 20.13.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
632 20 Data Visualization
Figure 20.12 Left: a side-by-side column chart, useful for comparing two data sets. Right: a
stacked column chart, often used to show the total of two data sets
Figure 20.13 Left: a side-by-side bar chart, useful for comparing two data sets. Right: a stacked
bar chart, often used to show the total of two data sets
20.3.4 Histogram
A histogram, such as the one depicted in Figure 20.14, is a graphical representation of con-
tinuous data measured at intervals using bins. It shows the shape (skewness, symmetricity)
of the distribution of data for a single quantitative variable such as age, shades of grey in an
image, and so on. Only one statistical variable is presented on the horizontal axis. The
vertical axis typically shows the number count or percentage of occurrences observed in
data. The height of the bar corresponds to the number of data points falling into the corre-
sponding range of the bin. Often, less variation in the height of the bars on the vertical axis
indicates more variability on the horizontal axis [23, 24].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.3 Types of Data Visualizations 633
Figure 20.14 Histograms with different data distributions. Clockwise from top left: symmetric, uni-
form, right skewed, and bimodal
Compared to the bar chart, between the histogram bins there is not a gap, and the size of all
bins is similar.
Depicted in Figure 20.15, the pie chart, or “circle graph”, is a plot providing a fast overall
impression of categorical data with a lack of detailed information. It is a plot used to display
data with the help of sectors in a circle. It shows only positive values. The number of sectors
corresponds to the number of categories and the sector size is proportional to the number
of values in the measured class. The sector size is calculated according to Equation 20.1.
Equation 20.1
Slices should be sorted according to their size, from the largest to the smallest, where the
largest should begin at 0° degrees, at the top. The graph can be enriched with text placed in
each sector, with the name of category and its frequency. Pie charts do not allow one to show
data changes over time [21].
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
634 20 Data Visualization
Figure 20.15
An example pie chart with text labels showing the infor-
mation about the Group names and size of the sectors
The box plot, also called a “schematic plot”, provides information about the spread and
symmetry of data distribution using the median, approximate quartiles (25th and 75th
percentile), minimum and maximum values of data points [25, 26]. The plot can be refined
to also show unusual data points, such as outliers. Box plots helps us explore data and iden-
tify hidden patterns; for example, it can highlight asymmetric and irregularly shaped data
distributions and show the extreme outliers [25, 26]. An example is shown in Figure 20.16.
Figure 20.16 An example box plot, in which the red line corresponds to the median; the horizontal
lines represent the 25th and 75th quartiles; the horizontal black lines show the minimum and
maximum; and the circles depict outliers
A heat map, such as that shown in Figure 20.17, is a graphical depiction typically used to
highlight a specific range of values of data indicating the observed feature. It allows us to
explore how a phenomenon varies and is clustered over space.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.3 Types of Data Visualizations 635
Figure 20.17 An example heat map, from which you can deduce that the most numbers of samples
were correctly associated with its corresponding category
The clustered heat map works with a data matrix, where individual cells are displayed as
rectangular tiles of a uniform size, filled with a specific color shades and hues based on
their data values. The columns and rows of data matrix usually correspond to the columns
and rows of the heat map [27, 28].
A tree diagram is used to map the structure of hierarchical data based on a branching sys-
tem, using a parent-child relationship. Each node can have zero or more children and data
on the same level. The siblings, should not overlap. Tree diagram enables multiple varia-
tions of layout. See example of tree diagram in Figure 20.18.
A A is a parent of B & C
There are, of course, many more kinds of data visualizations available than we have time to
cover here. Figure 20.19 depicts just a few of these, including the Venn Diagram, Radar
Chart, and Pyramid Graph.
Pyramid Graph
Pyramid Graph shows hierarchical
structure of data.
Value of Y
Group A
Group B
Group C
Group D
Value of X
Figure 20.19 A selection of additional visualization types which we will not cover in this chapter
The variable data types can be divided into the following categories:
Qualitative (categorical): This is data that can be divided into characteristic groups, or
categories. These could include, for example, hair color (“blond”, “brown”, “black”) and
education level (“basic”, “high school”, “university”).
Quantitative (numerical): Quantitative data can be counted. Thus, it could include vari-
ables such as age (10 years, 24 years, 36 years, 75 years), height (173 cm, 180 cm,
154 cm), weight (15.0 kg, 66.4 kg, 103.8 kg), number of children in a household (0, 1, 2,
3), price of an item (€ 5, € 8, € 10) and so on.
Discrete: Discrete data is represented in whole numbers, as integers. This results in a gap
between data points, since it is not possible to divide discrete values. Thus, the number of
children a family has (0, 1, 2, 3), or the number of hospital visits a patient made in a year
(0, 1, 2, 5), are possible examples.
Continuous: Continuous data is counted in decimals, as floats. Theoretically, between
data points is no gap, as it is always possible to divide the space between two values into
infinitely smaller parts. Body temperature, for example, need not be measured in whole
degrees (36.5 °C, 37.0 °C, 35.8 °C), and thus would be an example of continuous data, as
would height (173.2 cm, 180.0 cm, 154.2 cm) and weight (15.4 kg, 66.0 kg, 103.1 kg).
Binary: This is data which can be classified into just two categories. One of the sources of
this type of data is the closed question, to which the answer can only be “yes” or “no”.
Binary data values can have a quantitative or qualitative character. For example, life sta-
tus (“alive”, “dead”) or (1, 0), health status (“healthy”, “ill”) or (0, 1).
Nominal: Nominal data is naturally disordered, with no sequence between the catego-
ries. The data values can have a quantitative or qualitative character. For example, hair
color (“blond”, “brown”, “black”), aircraft model types (“helicopter”, “airplane”, “drone”,
“balloon”), number of children (0, 1, 2, 5).
Ordinal: Ordinal data is naturally ordered and discrete, meaning there is a sequence be-
tween the categories. The data values can have a quantitative or qualitative character. For
example, the level of vertebra fracture (“mild”, “moderate”, “severe”) or (1, 2, 3), level of pain
(“mild”, “moderate”, “severe”) or (1, 2, 3) would all fall under the category of ordinal data.
Interval: Interval data is continuous data which has no true zero point such as time.
Ratio: Unlike interval data, ratio data has true zero point. It can include measures such
as height (0.3 cm, 1.8 cm, 30.4 cm) and weight (15.0 kg, 66.4 kg, 103.8 kg).
The variable data types and their relationships are shown in Figure 20.20.
Data Types
Quantitative Qualitative
As shown in Figure 20.21, for each specific type of data, certain kinds of data visualizations
are more appropriate than others.
Quantitative Qualitative
Discrete
Scatter Plot
Continuous
Bar Chart Heat Map
Line Chart
Column Chart Tree Map
Histogram
Pie Chart Venn Chart
Boxplot
Funnel Chart
Funnel Chart
Waterfall Graph Figure 20.21
Mekko Chart
Radar Chart
Appropriate data
Pyramid Graph visualizations per type
of data
The right data visualization to use can also be selected also according to the intention of the
communication. Figure 20.22 provides a comprehensive overview.
What is communicated?
Heat Map Column Chart Box Plot Tree Map Area Chart
Figure 20.22 Appropriate data visualizations according to the message being communicated
plex data, which can be used to gain knowledge. Depending on its intended purpose, it can
be used to explore trends over time, depict the distribution of a dataset and find hidden
features.
This can be achieved by a simple uniform design, where each item presents a clear and
direct piece of information. Optimal data density, which can be achieved by avoiding over-
plotting, has to be considered: The amount of data should not overlap to the point where it
is difficult to see the correlation between variables. All data should be clearly readable. This
also applies to any supporting text, whose size and font style has to be appropriately
selected. See examples of overfitting in Figure 20.23 for a scatter plot, Figure 20.24 for a line
chart and Figure 20.25 for an area chart.
Wrong Correct
Text
Figure 20.23 Left: an example of overfitting in scatter plot; right: a more appropriate display
(notice how the correct result can be achieved by reducing the point size)
Wrong Correct
Correct Correct
Figure 20.24 Top left: overfitting in line chart. For better readability, it is often recommended to
display only one area per chart, as is shown in the remaining examples
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
640 20 Data Visualization
Wrong Correct
Correct Correct
Figure 20.25 Top left: overfitting in area chart. For better readability, it is often recommended to
display only one area per chart, as is shown in the remaining examples
Additional attributes such as colors, textures, hue and geometric symbols can be added, to
attract the audience’s attention and help them to better understand the topic. If the presen-
tation is to include multiple visualizations, the same graphical style should be kept through-
out. In general, depiction should not distort the data.
Often it is beneficial to highlight the data points of interest, so that the audience is drawn
to them immediately. This can be applied for example in a scatter plot for important data
values, or in a line chart for the points with the maximum and minimum values. Examples
this are provided in Figure 20.26.
Correct Correct
Figure 20.26 Correct examples of a scatter plot. Left: showing the trend line in red; right: high-
lighting important points
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
20.5 Tips and Tricks 641
When creating visualizations for unordered variables, it is beneficial to follow the principle
of effect ordering for data display. This principle says that the unordered data should be
sorted according to the effect that is supposed to be showed. Similar items are placed
together, which makes the data visualization globally more coherent [29]. Figure 20.27
shows this principle being applied to a column chart.
Wrong Correct
Figure 20.27 Left: unsorted columns in a column chart, leading to a confused message. Right: the
same column chart, but following the principle of effect ordering for data display
You are also welcome to use labels in visualizations. For example, in the case of a pie chart,
people can have difficulties comparing the size of the slices. Thus, it can be beneficial to add
the text containing the size information, as depicted in Figure 20.28.
Correct
Figure 20.28
Comparisons in pie charts can be made easier by
adding labels for group name and size
Whereas a zero baseline in a bar or column chart, or a histogram, is required, in a line chart
it may be optional. In cases of comparison, relative values instead of absolute should usually
be used. This gives space for proper interpretation, and correct conclusions can be drawn.
The width and number of the columns in a histogram should be checked according to the
level of detail you wish to present. Whereas having fewer, wider columns provides less de-
tailed information, and may cause some important patterns to be hidden, having many
narrow columns allows you to show more details, can make it difficult to distinguish the
noise from the valid data. In an ideal case, such as that shown in Figure 20.29, a trade-off
between the level of detailed information and the size and number of columns in histogram
is achieved.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
642 20 Data Visualization
Wrong Wrong
Correct
Figure 20.29 Example histograms with differing number and sizes of bins. Top left: having many,
narrow bins reveal more detailed information but can include noise; top right: having fewer wider
bins provide less information; bottom: a histogram correctly balancing the level of detail presented
In the case of a line chart for showing a small amount of data, we show measured data as
points connected by a line. As Figure 20.30 shows, using point markers ensures that, when
datapoints are missing, this is visually clear. This prevents the confusion and negates the
risk of people making incorrect assumptions.
Wrong Correct
Figure 20.30 Left: an incorrectly displayed line chart for the case of small data where some
datapoints may be missing. Right: a corrected version, where data values are displayed as points
connected by a line
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 643
In addition to these tips and tricks, each data visualization task has to be considered and
planned individually, in order to communicate the required message for the specific audi-
ence in the correct manner.
■ 20.7 In a Nutshell
Data visualization brings vital insights into data. It is about exploring the data
features and helping us find hidden patterns. It allows better and faster under-
standing of a topic, and effective communication of that topic to new audiences.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
644 20 Data Visualization
References
[1] W. Singer, “The ongoing search for the neuronal correlate of consciousness,” Open MIND, vol. 36,
p. 36(T), 2015, doi: 10.15502/9783958570344
[2] L. van der Maaten, G.E. Hinton, “Visualizing High-Dimensional Data Using t-SNE,” Journal of
Machine Learning Research, vol. 9, p. 2579–2605, 2008
[3] “What is Cloud Storage? | AWS.” Amazon Web Services, Inc. Accessed September 2, 2021. https://
aws.amazon.com/what-is-cloud-storage/.
[4] G. Andrienko et al., “Big data visualization and analytics: Future research challenges and emerg-
ing applications,” CEUR Workshop Proc., vol. 2578, no. February, 2020.
[5] “Klaudios Ptolemaios – Wikipedie.” Wikipedie, Otevřená Encyklopedie. Last modified August 25,
2005. https://cs.wikipedia.org/wiki/Klaudios_Ptolemaios.
[6] H Gray Funkhouser. 1936. A note on a tenth century graph. Osiris 1 (1936), 260—262
[7] “Data Visualization.” Wikipedia, the Free Encyclopedia. Last modified December 18, 2005. https://
en.wikipedia.org/wiki/Data_visualization.
[8] W. H. Inmon and D. Linstedt, “A Brief History of Data,” Data Archit. a Prim. Data Sci., pp. 39–44,
2015, doi: 10.1016/b978-0-12-802044-9.00007-6.
[9] “Data Visualization: History and Origins.” Think Insights. Last modified August 17, 2021. https://
thinkinsights.net/digital/data-visualization-history/.
[10] “Triangulace (metodologie) – Wikipedie.” Wikipedie, Otevřená Encyklopedie. Last modified Au-
gust 9, 2006. https://cs.wikipedia.org/wiki/Triangulace_(metodologie).
[11] “The Technology, Science, and Inventions of the 16th Century.” ThoughtCo. Accessed Septem-
ber 2, 2021. https://www.thoughtco.com/16th-century-timeline-1992483.
[12] D. Roegel, “A reconstruction of the tables of Rheticus ’ Canon doctrinæ triangulorum (1551) HAL
Id : inria-00543931 A reconstruction of the tables of Rheticus ’ Canon doctrinæ triangulorum,” no.
1551, 2021.
[13] K. Francis, M. Jacobsen, and S. Friesen, “The Use of Graphics to Communicate Findings of Longi-
tudinal Data in Design-Based Research,” Proc. 2015 InSITE Conf., no. November 2017, p. 928, 2015,
doi: 10.28945/2240.
[14] T. Koch and K. Denike, “Essential, illustrative, or ... just propaganda? Rethinking John Snow’s
Broad Street map,” Cartographica, vol. 45, no. 1, pp. 19–31, 2010, doi: 10.3138/carto.45.1.19.
[15] “The Underappreciated Man Behind the “Best Graphic Ever Produced”.” National Geographic.
Last modified March 16, 2017. https://www.nationalgeographic.com/culture/article/charles-mi-
nard-cartography-infographics-history.
[16] “Analyzing Minard's Visualization Of Napoleon's 1812 March.” Thoughtbot. Accessed Septem-
ber 2, 2021. https://thoughtbot.com/blog/analyzing-minards-visualization-of-napoleons-1812-march.
[17] “Florence Nightingale Understood the Power of Visualizing Science.” Science News. Last modified
May 13, 2020. https://www.sciencenews.org/article/florence-nightingale-birthday-power-visualiz-
ing-science.
[18] R. Rau, C. Bohk-Ewald, M. M. Muszyńska, and J. W. Vaupel, “The Lexis Diagram,” no. 2001, pp.
5–10, 2018, doi: 10.1007/978-3-319-64820-0_2.
[19] “Charles Joseph Minard.” Wikipedia, the Free Encyclopedia. Last modified January 18, 2006.
https://en.wikipedia.org/wiki/Charles_Joseph_Minard.
[20] “Bar Chart.” BetterEvaluation. Last modified October 31, 2014. https://www.betterevaluation.org/
en/evaluation-options/BarChart.
[21] “Pie Chart.” BetterEvaluation. Last modified October 8, 2014. https://www.betterevaluation.org/en/
evaluation-options/piechart.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
References 645
[22] “What is an Area Graph, How Does an Area Graph Work, and What is an Area Graph Good For?
— Storytelling with Data.” Storytelling with Data. Last modified April 9, 2020. https://www.story-
tellingwithdata.com/blog/2020/4/9/what-is-an-area-graph.
[23] R. L. Nuzzo, “Histograms: A Useful Data Analysis Visualization,” PM R, vol. 11, pp. 309–312, 2019,
doi: 10.1002/pmrj.12145.
[24] L. Boels, A. Bakker, W. Van Dooren, and P. Drijvers, “Conceptual difficulties when interpreting
histograms: A review,” Educ. Res. Rev., vol. 28, no. September, p. 100291, 2019, doi: 10.1016/j.
edurev.2019.100291.
[25] M. Krzywinski and N. Altman, “Visualizing samples with box plots,” Nat. Methods, vol. 11, no. 2,
pp. 119–120, 2014, doi: 10.1038/nmeth.2813.
[26] D. F. Williamson, R. A. Parker, and J. S. Kendrick, “The box plot: A simple visual method to inter-
pret data,” Ann. Intern. Med., vol. 110, no. 11, pp. 916–921, 1989, doi: 10.7326/0003-4819-110-11-916.
[27] N. Gehlenborg and B. Wong, “Points of view: Heat maps,” Nat. Methods, vol. 9, no. 3, p. 213, 2012,
doi: 10.1038/nmeth.1902.
[28] L. Wilkinson and M. Friendly, “History corner the history of the cluster heat map,” Am. Stat., vol.
63, no. 2, pp. 179–184, 2009, doi: 10.1198/tas.2009.0033.
[29] M. Friendly and E. Kwan, “Effect ordering for data displays,” Comput. Stat. Data Anal., vol. 43, no.
4, pp. 509–539, 2003, doi: https://doi.org/10.1016/S0167-9473(02)00290-6.
[30] P.Kolář, “Posilování stresem Cesta k odolnosti,” Universum, pp. 61, 2021
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
“Data is the raw material of the 21st century” is a common statement in C-level floors of
large companies. Many CEOs are paying increased attention to this topic. Companies that
ignore the trend of generating value from data with analytical processes risk jeopardizing
their existence.
Therefore, many companies face the question of how they can set up a sustainable, compa-
ny-wide data strategy. This new orientation can sometimes challenge existing business
models. Sometimes it even means that the data strategy can fundamentally change a com-
pany, requiring completely new business models.
For many companies it’s not easy to get started, since they don’t have the experience of
Google or Facebook, which have been mining petabytes of data for many years. This chapter
shows from a business perspective how a company can build a sustainable data strategy.
In this chapter, a model is presented, which consists of the three areas “Technology”, “Orga-
nization” and “Business”.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
648 21 Data Driven Enterprises
■ 21.1 T
he three Levels of a Data
Driven Enterprise
Figure 21.1
The intersection between Business,
Technology and Culture
Data in the enterprise context is multidisciplinary. This is not just a technical challenge but
also an organizational and cultural one. In order to increase data usage within an enter-
prise, leaders need to change corporate policies and the culture. This is explained in more
detail in Section 21.2, “Culture”.
Section 21.3, “Technology,” explains the economic aspects of data platforms in more detail.
Technical aspects are not discussed here, as in principle, the entire book deals with techni-
cal aspects.
Section 21.4, “Business,” discusses certain basic requirements for implementing data-
driven projects. Specific use cases are not mentioned here, as you will find a comprehensive
presentation in Chapter 16, which is structured according to industries and areas.
■ 21.2 Culture
“Each strategy lasts until the first contact with the enemy.
After that, there is only one system of substitutes.” – Helmuth Graf von Moltke
A key aspect of a corporate strategy is the organization of its units. The following sections
focus on the core aspects:
1. Corporate strategy for data
2. Culture and organization
The first part, Corporate Strategy for Data, is primarily about the maturity model and how
to create a data strategy. The second part, Culture and Organization, looks at corporate
development. The essential point here is that the two aspects do not have to be carried out
iteratively, but can run parallel and concurrently.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 649
Every hike begins with a position fix. If you want to climb a mountain, you have to make
sure you have suitable hiking maps (in digital or non-digital form) with you and know your
position. Similarly, if one wants to establish a data strategy in the company, it is essential to
know the current state first. Therefore, a current state analysis is normally carried out
during strategy development. With the help of this analysis, it is possible to determine the
maturity level of one’s own organization. In the following, the four maturity phases are
presented.
Phase 1: Disorganized
Strictly speaking, a company is already analyzing data when it stores information in Excel
spreadsheets and uses this data as the basis for discussions in meetings. However, we do not
want to take the view of information in spreadsheets as an analytical process here.
When we talk about companies in Phase 1, we are talking about those in which deci-
sion-makers usually make decisions based on gut feelings. In meetings, they may discuss
what is written in Excel spreadsheets and what should be done as a result. However, it is not
uncommon for decisions to be less strategic, and instead, made on an ad-hoc basis. Often
they are based on compromises.
There is nothing wrong with this corporate culture per se. However, if you want to position
analytics in this phase, you have to make sure that the mentality of the people involves
changes. Decision makers must learn to trust data analysis. When management under-
stands that more can be achieved with a structured approach than with ad hoc decisions,
the foundation is laid for Phase 2.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
650 21 Data Driven Enterprises
Phase 2: Experimentation
If we look at Big Data as a discipline for efficiently analyzing mass data, many companies
are in Phase 2.
Management has already understood that complex analytics can generate value. Individual
decision-makers may also already have ideas about how their own companies could benefit.
But there is still a lack of confidence. Even though numerous consultants have already
pointed out the urgency to act quickly, executives still hesitate to make major investments,
well aware that data programs can be expensive.
Instead of investing a lot of money in hardware and programs, employees are given the
space to experiment on their own initiative on the back burner, in the hope that greater
value will become apparent here, which might then justify an investment.
Depending on the size of the company, it can also happen that several people or depart-
ments want to take on this issue. Accordingly, conflicts of responsibility can arise. In large
companies, it often happens that individual departments work on overlapping projects with-
out knowing about each other.
Often, data initiatives are heavily IT-driven because many IT staff are curious and eager to
try out new technologies. Sometimes, however, there are also efforts within various depart-
ments to work on the data topic, and individual departments may enter into competition. To
a certain extent, this is desirable, as it generates a lot of ideas. However, if this happens
without strategic orientation, it can be rather counterproductive and block projects in the
long term.
It is important for the data strategist not to get caught between the two fronts. Instead, they
should aim to bundle the interests of the people involved, and pragmatically direct the
energy to where enough value is generated with data so that the company management
releases further investment capital. This is the only way to reach Phase 3.
Phase 3: Operationalization
While Phase 2 is still about discussing the value of data-driven businesses, Phase 3 is already
clear on how a business can benefit from data. In Phase 3, proof of concepts (PoCs) have
already been created to determine how the company can generate added value from data.
What companies lack in Phase 3 is the operationalization of the solutions. In other words,
the results that were determined in PoCs must now become operationally usable solutions
and services for the department or for the company’s customers. This step is often a very
complex one, since PoCs are often very simple and essential parts of the PoC have to be
extensively reworked for operational use.
Let’s look at operationalization with some examples. Self-driving cars have been on the road
for years and are accident-free in a controlled environment. But it will be some time before
we are able to trust these cars to handle all situations at least as well as human drivers.
Think here specifically of chaotic traffic situations like a rush hour in a major city.
In a PoC in a controlled environment, it is not rocket science to identify test persons via
machine learning when they enter a store. But making a global and automated solution out
of it, in which customers also receive personalized offers and all data protection require-
ments are met, is another matter.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 651
The effort required to turn PoCs into real operational applications is often underestimated.
Many believe that transferring results from PoCs to production is just a formality. However,
many companies fail at the very first operationalizations because the complexity is high and
many problems can arise, from data governance to infrastructure.
Phase 4: Automation
In the automation phase, the company has an operational platform. New analytical use cases
follow a process that is partly automated and through which success can also be measured
As before, PoCs and subsequent operationalizations can be complex and can also get out of
control. Data science is a structured process to generate new knowledge, it will never work
without the risk of failure.
The essence of Phase 4 is that companies understand the process and the value that is gen-
erated from data.
The mindset is also already there at this point. The company perceives itself as being
data-driven. No one questions the value and benefit of the transformation anymore. The
teams around big data and analytics are constantly creating added value for the company.
The Current State analysis is the actual starting point for a future data strategy. Here, the
maturity level in the areas defined at the beginning, “Technology”, “Business” and “Cul-
ture”, is determined. The analysis is typically carried out using the questionnaire technique
and is intended to provide a comprehensive picture of the state in the company with regard
to data. The following questions are asked in the three core areas:
Technology
Type of technologies and databases used
Data volumes in the systems and (daily) growth
Type of existing “data marts” and their data
Security and authorization systems for data and policies
Use of the data and type and frequency of access to them
Governance systems for data
Standards and systems around data quality
Systems that produce data
Organization
Business units that deal with data, such as
– Employees in the company and their know-how for data, and
– Departments dealing with data
Business
Current projects
The future project portfolio
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
652 21 Data Driven Enterprises
The business environment (e.g. competitors) and the company’s own assessment in
comparison to it
Problems that exist in the handling of data
The time it takes to access the data
Use of external data
General maturity level of the department to use data for decisions in daily work
Depending on the size of the company, this analysis can be more or less extensive. If one
considers a small company with a centralized corporate structure, the analysis will be very
quick to perform. However, the authors have a lot of experience in working with large inter-
national corporations with decentralized structures. Here, the respective country organiza-
tions with their respective departments must be considered.
After the completion of the analysis, the evaluation and derivation of the most important
strategic levers follows.
Project Selection
Let’s look at the value proposition of Advanced Analytics and Big Data. These serve to
reduce costs,
help make quicker decisions,
open up new markets and generate new business models.
It can often be helpful to try to identify specifically in which areas the company is operation-
ally blind. For example, a telecommunications provider traditionally makes money from
subscribers. By using Big Data, completely new business areas can come to light. However,
those who are always focused on driving up subscriber numbers may overlook what other
revenue streams may be available to the company.
Common questions are:
How can we use a customer journey to determine what motivates customers to cancel
(churn)?
How can we use data to ensure that our machines fail less often (predictive maintenance)?
The next step is to identify the potential data sources. The attention should not only be on
existing data sources, but also include potentially new ones. There should be no barriers to
the thought process.
The result should be a matrix of existing data sources and their relation to possible use
cases. Companies that have numerous or complex data sources can also purchase a Data
Catalog.
It can be helpful to start with a first high-level exploration of the data in this phase in order
to find information about the data quality.
The final step is project selection. Concepts of classical project selection play a role here.
The following points should be essential in the consideration:
Costs vs. Benefits of the Project: The ratio of costs to expected revenue increases is
calculated. This results in either cost savings or new business models.
Feasibility: Projects can be complex and lengthy to implement.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 653
Now that the corporate strategy has been established, it is time to look at the culture within
the company and the organization. Whole books are written about corporate cultures, but
here we want to focus “only” on the topic of data. This chapter will outline that, but let’s get
one thing straight: data is driven by the business, so it’s important that the unit for data is
not hidden in some deep hierarchy in IT. Simply because there is no value to be added by
doing so.
The Team
In any case, creating and implementing the data strategy is a large team effort that requires
different roles. We now look at these roles from a business perspective, having already
looked at the platform perspective in Chapter 3, which includes information on the tasks of
the operational roles Data Engineer and DevOps Engineer.
Business Analyst: This role comes directly from the business environment of the com-
pany. A Business Analyst knows the company well and also the individual functions and
divisions. She determines which projects and measures make sense in the company and
what added value they bring. Analysts can also be assigned to individual departments. In
banks, for example, there are Risk Analysts.
Domain Expert: Often the most underestimated role in any data program is the Domain
Expert. This person knows the domain inside and out from an operational perspective. In
retail, it may be the ladies and gentlemen who have been sitting at the register for 20
years sifting through merchandise. In a telecommunications operation, it may be a Net-
work Engineer who understands every detail of a radio access network. Domain Experts
are the people who can provide feedback on whether and how a data program’s business
ideas are feasible.
Data Scientist: We have already discussed the tasks of a Data Scientist in Chapter 2.
Therefore, we will only look at the difference between a Data Scientist and a Data Analyst,
since both analyze data.
A Data Scientist focuses on generating new knowledge by analyzing data, whereas an
Analyst only analyzes data to make a business decision. A data scientist generates new
knowledge by deriving future decisions from past data and helping the company to be-
come better. For example, he makes predictions about how business areas can develop in
the future. A Data Analyst, on the other hand, simply describes the current situation. The
data she analyzes could consist of, for example, the basis for an annual report with finan-
cial figures.
BI Engineer: BI Engineers are visualization professionals. They know how to turn boring
data into exciting graphics that create real “aha” experiences. The role of BI Engineers
can also be seen as a bridge between Data Science and the business. BI Engineers usually
use self-service BI tools and help users to prepare data in an appealing way.
Business Data Owner: Business Data Owners are a kind of data accountant. They make
sure that governance rules are followed. A business data owner constantly drives data
quality and thus plays a major role in helping data scientists do their job better. In this,
the Business Data Owner evaluates the input data and checks it against quality metrics.
They should have a very good understanding of how data producing systems work and
why data is in a particular format. In manufacturing operations, this could be machine
data, for example. The Business Data Owner understands how the machine works and is
therefore a good judge of why data is produced and in what form. Their importance is
often underestimated, but they belong in a data team like controllers belong to account-
ing. The business data owner has the overall responsibility of the data in the respective
business department.
Internal Organization
One issue that is often discussed is the question of the reporting lines. Since data programs
are almost vital to the survival of many companies, the reporting function should be as high
up in the hierarchy as possible. In large companies, this is the Chief Data Officer. This
person leads the data strategy and is essential when it comes to transforming the company
into a data-driven enterprise. This is primarily a person who is entrusted with strategic
decisions.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 655
Some companies see analytics as part of IT. This is not advisable unless IT solutions are the
core products of the company. Data Engineers should remain in the line of the CIO, while
Data Scientists are to be found within the business departments. The background is that the
goals of Data Scientists are not the same as those of IT departments. IT departments aim to
ensure stable operations. An IT fulfills its purpose when it enables other departments to do
their job smoothly. A Data Scientist, on the other hand, produces knowledge that ideally
opens up new business areas.
Nor should an analytics department be part of a BI department, since the BI department is
usually several levels below the management. Conflicts of objectives also arise here. BI
departments aim to optimally handle business processes for a department with the help of
analytics. The task of a Risk Analyst is to provide a risk assessment, but the task of a data
scientist could be to analyze whether information can be gained from the risk assessments
of the analysts and the linkage with other data sources that help to strategically put the
company on a new footing. The Risk Analyst usually does not have this creative freedom.
Only in very few cases will a data program completely break through the corporate struc-
ture. In most cases, it will adapt to existing cultures. Often there is also a combination of
centralized and decentralized structures. A central team defines the framework within
which the decentralized teams can operate, but does not control the decentralized teams
(see CDO department).
1
https://agilescrumgroup.de/spotify-modell/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 657
This model of tribes makes it possible to establish a central unit, but at the same time bring
it closer to the business. However, this also requires that both the IT managers and the
managers in the business departments change their mindset in a cooperative direction. In
Tribes, the actual reporting line is in IT, whereas functional management is the responsibil-
ity of the business. The respective elements are described below:
Tribe: The tribe is the highest order in the agile setup. Here, there is usually a tribe
leader who only performs technical tasks to a limited extent. This person acts more as a
strategic HR manager. Core tasks are employee reviews, training, development and the
assignment to the right squat. The Tribe Leader coordinates closely with the business
departments and learns about the strengths and weaknesses of the individual members
of the Tribe. The Tribe Leader has a large number of direct reports, typically ranging from
around 30 to several hundred.
Squat: A squat is comparable to a Scrum team. The squat is assigned to a business func-
tion and covers its business cases. A squat can be temporary or long-term, depending on
which use case is to be implemented. A squat consists of all the functions that are neces-
sary for a use case. A Squat is usually led by a Squat Leader (Scrum Master).
Chapter: As individuals work in the squat, they are often not arranged per their usual
subject matter. This can lead to Data Engineers or Data Scientists often acting as “lone
wolves”. This problem is solved by chapters, where the respective professionals are
grouped together.
Figure 21.4 represents this setup.
The illustration shows a fictitious tribe in a telecommunications company. Here there are
various business functions such as “Marketing”, “Network”, “Products” and “Media”. For
each of these functions there is a squat that implements projects together and under the
ownership of the business. All four Squats make up the Tribe. It is important that each
squat can act independently, which not only affects prioritization, but also technology selec-
tion. A key enabler for this is the “Data Mesh”, which is described a little later in this chap-
ter. However, in order for there to be uniformity of technologies, it is important to ensure
that the respective functions are exchanged across chapters.
Clear Responsibilities
Since data is multidisciplinary, this also creates some problems. One of the core problems
is the question of who is responsible for what in the company. This question often leads to
long discussions which waste time and resources. A remedy here can be a RACI matrix,
which simply regulates the responsibilities around the topic of “data”. RACI stands for:
Responsible: Person(s) or departments that are responsible for the actual implementa-
tion. This is, for example, the use case team, which consists of data engineers and data
scientists.
Accountable: Person or department that is mainly responsible for the result (e.g. the
product owner, budget manager, ...)
Consulted: Person(s) or department(s) who have important information about the
product. They should be consulted, but not responsible for the outcome. This could be, for
example, Legal, Privacy or Security.
Informed: Person(s) or department(s) who have the right to information about the prod-
uct.
Here, the respective points, such as who is responsible for the operation of the platform,
who is responsible for the data quality, and so on, are presented in a list. In the horizontal
dimension, the respective stakeholders are listed. These are usually:
Business: The particular business function of the company, such as marketing, finance,
and purchasing
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 659
In the long term, each of these decentralized data sources - which can already be called a
data mesh - should offer its own APIs. This achieves harmonization without comprehensive
data integration programs. You only bring the silos closer to standards and thus improve the
data quality in the long term and sustainably.
One problem that all large companies have to deal with is the so-called “skills gap”. Many
key IT positions are often difficult or impossible to fill. Large, established companies also
have this problem, although it affects them less intensely, as many employees tend to opt for
large corporations. This looks good on the CV and corporations usually pay much better
than small to medium sized companies.
Nevertheless, the skills gap affects all companies. Therefore, it is not enough to launch large
internalization initiatives. It takes much more than that: building skills on the one hand and
networking existing employees on the other.
Figure 21.5
Skill Development: Groups
Figure 21.5 shows the three main groups relevant for skills development in the enterprise.
The respective groups are described below:
All Employees: In this group it is important to create an online learning program for all
employees of the company. Only the basics should be taught here, ranging from use cases
around data to explanations of common terms. The goal is that employees, such as those
in controlling, helpdesk or similar, are familiar with the basic concepts of data. The aim
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.2 Culture 661
is intentionally not to reach a technical level, as such a goal could scare off this group of
people. This group is very important, as there is potential here for employees to become
more interested in the data topic and thus be considered for re-skilling.
Managers: In this group, managers in particular should be addressed. Depending on the
level, courses should be offered either virtually or as classroom training. Lower and mid-
dle management should be taught about the tools and possibilities of data. Often it is
middle management that is afraid of digitalization and thus blocks these topics. The
courses should take away the fear of the subject and bring data-driven thinking to this
group. For top management, a different approach should be taken. Ideally, a one-day work-
shop that needs to be planned long in advance. The workshop should primarily empha-
size the business value of data. But it is also important to mention some basic terms and
technical challenges, such as data governance.
Experts: Very important is also the further training of the experts. There are several
reasons for this: On the one hand, you have to make sure that the most important experts
are always up to date. On the other hand, training offers bind these experts to the com-
pany. This group also includes potential experts. These are, for example, software devel-
opers or mathematicians who would be able to work with data. It is often easy to offer
them re-skilling. It is important in the group of experts that they receive targeted train-
ing, like opportunities to attend specialist conferences or take courses on new tools.
Extensive skill development within the company is essential in any case. This section can
only provide an overview. Ultimately, it is important that the programme is developed in
close cooperation with HR. In many large companies there are specially appointed learning
managers for this purpose.
Community Development
Many companies have experts on the topic of data. The cooperation of these is often only
ad-hoc and very unorganized, as they often work on projects in the specialist departments.
This can lead to duplications. Furthermore, synergies are not used. These problems can be
solved by an organized community within a company. A role for community management
can be established at a central point, such as the CDO department. In those companies
where this was the case, large sums could be saved because projects could be reused and
technical platforms as well.
Typically, different formats are used in a community:
Monthly video conferences: various projects are presented here with technical details.
It is advisable to make the whole thing virtual so that other locations can also be inte-
grated.
Annual conference: it is important that all the people involved meet in person once a
year. Since it involves travel, it is advisable to make it a two-day event. This way you can
also include an evening event, which provides a relaxed atmosphere.
In addition to conferences, it is also important to establish internal discussion platforms.
This keeps the people involved in contact with one another outside of the monthly video
conferences. It has been shown that an annual conference is also an important “booster” for
this.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
662 21 Data Driven Enterprises
■ 21.3 Technology
This section deals with the business-relevant aspects of the technical platform. We have
already dealt with the technical details of platforms in Chapters 2 and 3.
Business users are usually less interested in the technical details, such as which framework
is used and in which programming language features are implemented. From a business
point of view, it is relevant whether a platform also has the necessary features to implement
new analytical use cases. Offered systems should have all enterprise features, such as
governance features.
Many big data solutions are based at least in part on open source products. For a long time,
business users in particular argued that free software could not be integrated into the com-
plex IT landscapes of large companies that had evolved over decades. They lacked the enter-
prise components that only proprietary solutions could provide.
Those days are long gone. Behind open source distributions are companies with commercial
interests. These companies extend the open source core with exactly those enterprise
features that their customers need. The business model of making money from open source
distributions by making them “enterprise ready” is also not new. Companies like Red Hat or
SuSE have been successfully running this business model with Linux for years.
The open source business model also exploded the idea of system integrators being able to
be considered universal service providers. Even the largest software houses now use open
source components. Accordingly, the question of whether open source or not hardly arises
any more.
21.3.2 Cloud
The shared vision of the authors of this book is that cloud services will have a status similar
to electricity in the near future. There may be individual organizations, such as hospitals,
that have a generator in the basement to stay in operation during a power outage, but the
norm is to purchase electricity and not produce it yourself.
With on-premise solutions, seven-figure sums often have to be commissioned in order to
start a Big Data program. Many managers often do not want to take responsibility for this
and therefore prefer to start with a lightweight solution in the cloud. In addition, cloud pro-
viders are offering more and more enterprise features that on-premise solutions do not.
Therefore, there is no way around the cloud.
Another argument in favor of the cloud is that it offers precisely the features that are suit-
able for a fail-fast approach and thus enables a process such as “agile analytics". The elas-
ticity and scalability of cloud solutions creates a flexibility that an on-premise solution
cannot offer.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.3 Technology 663
Cloud providers also know that companies will not migrate completely to the cloud over-
night. Concepts such as the hybrid cloud, in which companies move only parts of their
landscape to the cloud, are booming accordingly.
Choosing the right platform and deciding on a vendor is harder today than ever before.
The days when companies could trust a universal solution provider to deliver one big appli-
ance that runs everything are over, thanks to the diversity on offer. IT leaders are faced with
a myriad of different solutions and product combinations that must also be aligned with a
cloud strategy.
Problems with technologies are usually in the details. They don’t come up in sales conver-
sations. In addition, vendors not only offer a platform, but sometimes also associated pack-
ages, such as industry-specific data models, which are not always compatible with all soft-
ware.
Changing trends in the offerings of various vendors can also be observed time and again.
Some vendors that used to advertise their products with terms like ‘DWH’ or ‘Hadoop’ years
ago now only talk about an analytical platform without going into details about what exactly
is behind it.
Your core task as a decision maker is not to commit to a technology decision too early. Tech-
nology is a tool to meet your business needs. Some technologies support you better, others
worse. The better you know your goals and requirements, the easier it will be to choose the
right technology. Therefore, explore functional and non-functional requirements. What are
the performance expectations? For which existing operational systems should there be API
support? What does the data look like that needs to be processed?
Another important point is to know the existing IT landscape of the company. From this, you
can also estimate expenses and the readiness for a technology change. Be sure to motivate
your employees: Many are afraid of change or have had bad experiences with one technol-
ogy or another, which might even block a migration.
Vendor lock-in and its impact is also a major topic. If you decide in favor of technology X
today, how strongly are you making yourself dependent on them in the future? As a rule,
you want to be flexible.
Chapter 2 described how data warehouses were often sold by system integrators as an
expensive proprietary data solution, consisting of hardware and software. The data lake is
an alternative with an open source core that is sometimes supplemented by proprietary
software from vendors. The key argument here is cost and scalability.
A data lake replaces data warehouses only in the sense of proprietary databases, but it does
not replace the concept of data warehousing . Modern data lakes also include capabilities to
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
664 21 Data Driven Enterprises
store data in a similar form as in the traditional DWH, and they also allow SQL queries.
That’s why we cite the data lake as a reference for the data platform here in this book.
Data Lakes are often built based on a holistic view of data. A data lake thus serves as a cen-
tral repository for all of the company’s data. Data Lakes also have a comprehensive authori-
sation and security concept and offer data catalogues in which company users can access
the respective data sources.
The biggest danger of the data lake is the risk of a data swamp. If no one knows who owns
the data, and sometimes it is not even clear which data is located in individual directories,
the value of a data lake decreases accordingly. A data swamp usually occurs when data is
loaded into the data lake in an uncontrolled manner. You can prevent a data swamp with the
following measures:
Mindset building: Make it clear to all your employees that loading data into a data lake
in an uncoordinated way is frowned upon. Create data labs where employees can play
around with the data.
Governance: Mindset building is often not enough. To be on the safe side, you can also
introduce governance rules that ensure a managed process. We went into this in more
detail in Chapter 6.
Data Catalog: Uploaded data is indexed by a Data Catalog and thus data is also captured
semantically, see also Chapter 6.
In large companies, there is also the danger that there is not only one data lake, but many.
Communication is the best weapon against redundancy here.
In most industries, IT is an internal service provider that enables the business to generate
value through its services. Whether it’s a financial advisor, a grocer, or any other industry,
there is always a computer system somewhere that stores data or runs applications that are
essential to the value chain.
Even if we locate data science and analytics in business departments rather than in IT, data
initiatives mean an upgrading of the IT department, since data engineering often accounts
for a significant share of the effort. Power shifts and culture changes are feared. For exam-
ple, a production manager used to be the master of his plant. The statement, “I’m not going
to let others tell me how to work” can often be heard in production companies that are trying
to use data to produce smarter. This cultural change must be taken into account in order to
successfully implement the data strategy.
Once a company has built a data lake and fully implemented data governance and security,
they can move on to more advanced things that allow experimentation with real, ano-
nymized data.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.3 Technology 665
Users of a data science lab can typically only analyze subsets of data. Data in a data science
lab is usually masked and hashed to make sensitive data unrecognizable. Data science labs
often only exist for a certain period of time, such as during a specific marketing campaign.
After that, the labs are terminated or relaunched. This is why we also speak of time-limited
and unlimited data science labs.
Depending on the use case, different security and governance requirements have to be
considered.
Another key challenge is the fact that data science labs are often operated by humans. In
contrast to data preparation, this is not an automated process, but a manual one. Therefore,
it is necessary that access mechanisms are implemented in detail. Depending on the Data
Science Lab, different access mechanisms (control levels) must be implemented, as shown
in Figure 21.6.
Figure 21.6
How a data science lab is setup
Since 2019, one architecture for data in particular has been discussed very intensively: the
data mesh. Strictly speaking, this is not so much an architecture, but rather an “approach”
to data. First described by Zhamak Dehghani2 In her blog, the Data Mesh has become very
popular. The key idea is that decentralization of data is accepted above all: one of the core
concepts for this is the division into different domains and a clear ownership in the busi-
ness.
The Data Mesh is basically a microservice architecture for data. It shows that any know-how
about a specific data domain resides in the department responsible for it. For example, a
marketing department understands what the particular marketing data means much better
than an IT department ever can. In the past, this was also the reason for the many data silos
that have formed in companies.
2
Original post: https://martinfowler.com/articles/data-monolith-to-mesh.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
666 21 Data Driven Enterprises
With the help of data lakes and data warehouses, attempts were made to master the silos.
However, most of these attempts failed or could only be solved by using a lot of resources.
The data mesh approach takes up the idea of distributed data without necessarily having to
develop silos. Rather, the strengths of decentralized data are to be enriched by certain ele-
ments so that silos can be prevented.
Architectural Quantum
One of the fundamental elements underlying the data mesh is the theory of the “architec-
tural quantum”. This is a borrowing from microservice architecture. One should not build a
system too complex, but rather, create it with the smallest possible architecture for the
respective use case. In other words: the use case should determine the technology, not vice
versa.
With the data lakes of the past, one usually created a very complex product that was sup-
posed to cover all conceivable aspects of data. Often, however, the use cases were missing.
If you then looked at the use cases, you might find that the data lake was greatly oversized.
The architectural quantum takes a different approach: the architecture should be kept as
simple as possible such that the use case can still be solved.
This viewpoint is very much in favor of the public cloud, because many data platforms can
be used there on a turnkey basis. With the architectural quantum, one uses these platforms
and takes care of the implementation of the use cases. The focus thus moves away from the
development of the data platform, and onto the creation of value with data.
Data as a Product
Another very important component of a data mesh is the approach that data should be
traded as a product. This means that it should meet the quality standards that are also set
for the company’s own products. If the company is an automobile manufacturer, the compa-
ny’s data should be of the same quality as the cars themselves, for example.
Here, a very clear ownership comes into play: the respective department that produces the
data is also responsible for its quality. Since only the data producers can determine the
quality, this is of central interest. Logically, there must be company-wide, uniform quality
standards, which are usually regulated by a central office (the CDO).
Likewise, data products should be made available internally within the company. This
means that data should be shared with other departments in compliance with the respec-
tive guidelines and standards (such as the GDPR). This can be done, for example, through
data markets or data catalogs.
However, the data mesh requires clear governance: without uniform standards, quality
assurance and data catalogues,it will only promote further silos. Therefore, we need to focus
more on data governance.
■ 21.4 Business
At the highest level of Data Science maturity, a company is able to execute agile analytics
processes in iterations that quickly produce results.
Models usually start with a hypothesis that has to be confirmed or negated using static
procedures on data. For this purpose, data is prepared, analyzed and the results presented.
One of the most important developments in the next few years will certainly be the sale and
purchase of data. Many companies have been looking around for additional data sources for
a long time. One hopes for a better view of things or the customer. A variety of data markets
are currently emerging, which are populated by data producers, data markets and data
buyers.
Data Markets
In the examples described above, one question already arises: How can the data be offered?
On the one hand, there is the possibility of developing complete, delimited products. In this
case, however, one must ask oneself a variety of questions about the secure distribution of
this data. Since these are often companies whose core competence is not software develop-
ment, they often find this difficult. IT and software development is usually just an “unpleas-
ant, expensive cost center”.
To get around this problem, the major cloud providers have launched their own data mar-
kets. Amazon Web Services, for example, offers its own data catalogs with purchasable data,
as does Microsoft. This already makes the sale of data much easier.
Another development are decentralized data markets. Here,sellers offer their data and buy-
ers can obtain it. This way, one does not become so dependent on a provider, and the data is
only available to the companies that participate in the data market. In this constellation,
buyers and sellers often know each other in an economically close relationship.
Many of these decentralized data markets are just emerging, and some very innovative
products are being developed to enable these data markets. One example of this is the
German startup Nexyo3.
We divide an analytical use case into phases from idea to operationalization. These phases
are:
Idea: Someone has an idea for how to gain added value with data. This idea covers at least
one of the three Big Data pillars (cost reduction, faster decisions, new markets). A busi-
ness value must be formulated for the idea, otherwise it will not be heard and the chances
of implementing it are low.
Prioritization: Usually there are other ideas in parallel to one idea. In prioritization, a
company focus is set.
Exploration: An idea that is used for implementation becomes an opportunity. One starts
to collect data, to check their data quality and to formulate hypotheses. During the explo-
ration, the business value is also concretized and the corresponding processes for the
protection of personal data are handled.
Agile modeling: Models are formed, tested, and evaluated. We will go into this step in
more detail in the next section.
Retrospective: Does it pay off to roll out the results company-wide? What are the lessons
learned and how do they affect other use cases?
Operationalization: This is where the model becomes an operational solution that is
rolled out globally.
3
https://nexyo.io/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
21.5 In a Nutshell 669
The supreme discipline is then similar to what is already done in BI-self-service analytics.
The goal here is that business users (who come from marketing, finance or other areas, for
example) can perform simple data analyses as a workflow. This means it is not necessary to
deal with programming languages such as Python or R. However, for self-service analytics,
many basics such as a comprehensive security and authorization concept must first be in
place. It is also necessary that data sources are extensively documented so that business
users can pull together the right data sources. Finally, you need tools that are easy to use -
as easy as you would expect from Microsoft Excel. However, most tools for this are still in
their infancy and none has yet emerged that is as easy to use as Excel itself.
Well-known tools for self-service analytics are Tableau or Power BI. Web notebooks such as
Jupyter are also mentioned again and again in this environment.
■ 21.5 In a Nutshell
Creating High-Performing
Teams
Stefan Papp
“When nobody knows what should be done, everyone says: Something has to be done.”
Helmut Qualtinger (translated from German)
“Our company exists on the market not because of the management, but despite the management.”
Anonymous engineer
“Sleep my friend and you will see. That dream is my reality. They keep me
locked up in this cage. Can’t they see it’s why my brain says rage?”
Sanatorium, Metallica
■ 22.1 Forming
Let us start with a scenario during Halford’s data transition to address conflicts that could
arise during projects. The presented challenges may sound familiar to some readers. Many
jobs are not as straightforward as presented to students during their education. Data pro
fessionals with their first jobs often expect to work most of their time using the skills they
learned during their studies, such as modeling data, extracting features from it, and quali
fying the results. For some, the harsh reality is that they get stuck in corporate bureaucracy
during their daily job routine: filling out forms to get access to data, explaining why data is
needed, justifying how they spend their time in meetings, and thinking about how to satisfy
their boss that they’re still making at least some progress.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
672 22 Creating High-Performing Teams
This chapter introduces pragmatic ways to address challenges and end stalemates in data
transition projects through better communication, changing structures, and applying best
practices. Confronting challenges often requires courage; sometimes it feels like poking a
wasps nest. Therefore, at the end of this chapter, we present a vision of how rewarding it can
be in a high-performing team when all conflicts have been sorted out.
■ 22.2 Storming
22.2.1 Scenario: 50 Shades of Red
Bob was using Halford’s meeting room as an intermediary office to write emails when the
data scientists Mahsa and Rashid walked in. A tall, young man whom Bob had not seen be
fore entered after them.
“I am Klausi,” he said with a strong German accent, extending his tattooed arm to shake to
Bob’s hand. He had long, reddish dreadlocks, and what looked like a permanent smile. “I’m
a student and a user interface design artist. I analyzed the existing user interfaces and have
some pretty fascinating ideas for creating new vibes in this company. I —”
“I’m sorry,” Rashid interrupted. “I forgot to book this room. Mahsa and I have an important
meeting. Maybe you’d like to join us, Bob? It starts in 15 minutes.”
Before Bob could even agree, Rashid started setting up a projector, launching into a ram
bling speech as he worked. “IT could only spare some laptops from the Stone Age for us to
do some work,” he said, rolling his eyes. ”And without admin rights, of course. I couldn’t
install anything I need for my daily job. So, I brought my private notebook for some demos.
I figured I’d get here early to make sure the beamer works this time. I couldn’t present my
slides last time; there wasn’t a single adapter in this old school office. Who has VGA input
only these days? Anyway, I bought myself an adapter on eBay. I’ll get this ancient relic of a
beamer to work this time.”
“You weren’t the only one with problems,” Mahsa complained. “It was such a challenge to
get this appointment. So many people were unavailable. Others declined meeting requests
at the last minute. This is the fifth attempt to get everyone together.”
“What is the meeting about?” Bob inquired. “Do you have a written agenda?”
“We shared our thoughts on the internal chat group,” Rashid said.
“Are you sure that people have read them? Are people aware of what you are planning to
do?”
“Ehm . . . I guess so. Why else would they accept a meeting?”
Bob shrugged. “Everyone has to book work hours somewhere. Attending meetings legiti
mizes spending time with something, especially if they don’t have much to do elsewhere.
It’s not guaranteed that everyone who attends a meeting is also interested in its outcome.
Who did you invite?”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.2 Storming 673
“The product managers, Martin and Roza, and the nerds from hell. We finally need a proper
data platform to analyze our data. Our private notebooks have reached their limits, and I am
tired of excuses that they are ‘working on it.’”
At that moment, Roza and Martin, the product managers, entered the room. Both uttered a
quick greeting and hurried to take a seat at the furthest end of the meeting table. Both
opened their notebooks and started typing. Klausi eagerly went to greet them, but both only
returned a quick “hi,” without looking up.
Five minutes later, Cesar and Bill from IT entered. Cesar was lecturing Bill about something
technical and barely paused to nod to the others in greeting. Bob could not decrypt anything
into a common language that a non-IT professional would understand. When Klausi ap
proached them, he earned a ‘Clint Eastwood stare’ from Cesar. Klausi shrugged, kept on
smiling, and finally sat down, drumming excitedly on the table.
The product managers continued to stare at their notebooks, saying nothing; Cesar kept on
lecturing Bill, who kept on nodding. The data scientists checked the time on their smart
phones, waiting for the final visitor.
Five minutes late, the IT manager, Peter, finally arrived. Without saying a word, he sat down,
pulled the cookies on the table closer to him, and started to munch. He gobbled down one
after another, creating small piles of crumbs before him. Cesar finally ceased his lecture,
and opened his notebook.
Mahsa launched immediately into her presentation, talking so fast Bob found it hard to fol
low.
“Here, we want to show some slides of what’s kept us busy the last few weeks. After our
interviews, we concluded that we wanted to analyze our client data first to see if we could
improve anything. We ran into data quality issues. Multiple systems source our data, and
many clients are stored in each system multiple times. We hoped to get a golden data set to
get a unique view of the client, and then we want to give them personalized offers based on
their past behavior and —”
“Stop!” Cesar interrupted with his strong Spanish accent. “You have permission from
InfoSec?”
“Don’t worry about that,” Rashid waved away Cesar’s complaint. “We are cautious.”
“No, no, no . . . you still need permission,” Cesar insisted. “Who gave you access?”
“Fred, from Marketing. He shared his credentials with us, since he has super user rights to
almost every data source. He insisted it was okay.”
“Ay, ay, ay. . . I’ll have to investigate this very thoroughly,” Cesar warned. “This will have
consequences.”
The data scientists slumped back in their chairs. Cesar pulled out an energy drink, opened
a small paper notebook, and began jotting down notes, staring suspiciously at his two col
leagues.
Peter raised his hand but was distracted by someone passing by.
“You want to say something, Peter?” asked Bob.
“No. I just saw Jacqueline from the front desk passing by. I told her to bring me some coffee
and more cookies. . . But I also have a question, now, as you ask. Will we finish in time? I
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
674 22 Creating High-Performing Teams
must be early in the cantina today. Today is Viennese Schnitzel Day. The cook and I have a
long history. And we have had a feud since the day he wanted to give me Schnitzel made
from pork. Since then, we have been at war. I even wrote to HR to complain. Everyone knows
Viennese Schnitzel must be from veal. If I am late now, the cook will ensure that they give
me only a Kinderschnitzel, as we call a very small piece in my home country.”
There was an awkward silence, until Rashid broke it.
“Maybe we should discuss our infrastructure issues immediately to speed things up then.
We collected the data from various sources and used our notebooks to assess some findings.
But now we need a more professional approach. We tried to use this corporate data ware
house, but too many features are missing. We propose, therefore, to push some data to a
cloud-based data lake and run some jobs there.”
“Cloud?” Peter said loudly. “This company has a ‘no cloud strategy’. I thought I made myself
clear.” His balding forehead had begun to turn a color Klausi could have identified as coral
red.
Klausi raised his hand. “May I suggest something?” But Rashi answered first, addressing
Peter directly.
“For some data use cases, we collect unstructured data in files, and we need to run machine
learning use cases on top of it. How do you think I get this data into a data warehouse that
requires structured data and cannot handle files?”
“Don’t lecture me,” Peter said, his face slowly coloring to match his forehead. “I was already
programming when you were still in diapers. And back then, this was still real program
ming with real programming languages. Everything you will need is being built right now.
My team is working on it. You will get your platform. I designed it. It just needs some time
to be fully implemented. That’s all.”
“We talked with some data analysts who have been here longer than us,” Rashid said. “Ac
cording to them, your team has worked on new data platforms for years. Nobody has seen
an architecture diagram or presentation, let alone a demo.”
“I am a very technical person. As such, people will not understand my technical designs.
They misunderstand when I show them something drawn on paper, and then I have to ex
plain everything. I stopped doing this. Good things need time. It is done when it is done. I
will inform you.”
Mahsa crossed her arms, her face reddening too. Klausi might have called this Pantone red.
“We have been here for three months. We tried a lot. We suffered through a lot of challenges.
But in the end, we are limited by what the company gives us. We cannot meet the goals if
you only talk about Schnitzel.”
“Do you want me to lay out the red carpet for you?” Peter retorted. “I have limited resources,
and I am an engineer, not a magician.”
“Excuse me,” Bob said. “I understand that the requirements come from the business. And it
is the business’s problem if things are not finished in time. IT systems must enable business
processes. Martin and Roza, you are production managers. What do you need?”
Martin looked up from his notebook for the first time. “I am not a technical guy. I cannot
contribute to this discussion. I’m fine if I get the functionality we agreed upon; I’m not in
terested in the details.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.2 Storming 675
“But you should be more involved,” Bob said. “In the end, it’s your product. And if the data
scientists have no platform, how can they finish your product?”
“Well. . .” Martin paused. “I talked with the users again. And the users are also not so sure
about. . . you know, everything is very complex. Honestly, now, we have so many other prior
ities. I want to finish my old stuff before I take on new things. But look, I have to go to an
other meeting now. Let’s talk again in a month. I should have more time then.”
While Martin packed up his stuff, Roza also finally spoke up. “I built my product on a new
data platform by the vendor Speedy Gonzales. His platform, Andale, is simply amazing.
I want to create my product on top of it. I have some data engineers from the vendor on-site
now. I can ask them if we could host some of your data on our platform.”
Peter bit his lip at this. His face was really crimson now as if hellfire were boiling inside his
head.
“I still want to say something,” Klausi interrupted, louder and more seriously than he’d
spoken so far.
“Well go on then!” Peter yelled.
“My specialty is aligning human-machine interfaces with human feelings. My questions
revolve around the following: How do our users feel? What does it mean to them when they
see something that needs to work.”
He paused as everyone stared at him. The smile was long gone from his face. “All I wanted
to say was that we should ask ourselves what our users want. Is that not what every com
pany is supposed to do?”
Peter shook his head and then stared angrily at Roza. “What is a non-approved system doing
in my IT domain, and why am I learning about it in this meeting?”
“The vendor provided a test system for us. So, it is outside of your responsibilities.”
“I am responsible for all corporate IT, including some alien systems. Stop whatever you are
doing right now.”
“So far, the vendor has given us access to one of his systems. He takes full responsibility for
the POC. I will approach you when we need to integrate Andale into the rest of our IT sys
tem. We have an ambitious timeline, so I expect the full support of IT during integration,”
Roza said, staring Peter down the whole time.
“There is no IT outside my IT in this company. And the IT is me. I am not integrating any
thing I do not know about. Every IT topic is my responsibility. What if we are being hacked?
This POC must stop now.”
“If security is your only concern,” Roza said, “why don’t we return to electronic typewriters?
They’re more secure, after all. Or how about smoke signals? They’re even safer.”
I need to pull the plug before they kill each other, Bob thought. He stood up. “Ladies and Gen
tlemen, how about we postpone this discussion? Let’s go for lunch early and talk another
day.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
676 22 Creating High-Performing Teams
Bob met with Alice for their regular status meeting the next day. Before he could greet her,
Alice approached him with a pressing question.
“What was going on in that meeting yesterday? First, Peter, then Rosa an hour later; both of
them stormed into my office demanding that I fire the other immediately. I also received
emails from two of our data scientists saying they would quit if such meetings continued.
How can we cool down this situation?”
“You’re right, it’s bad,” Bob said, “but I’ve often seen such storming phases with other
clients. In transition projects, employees from different departments suddenly need to col
laborate closely. Some individuals have built up their own biased perceptions about their
deserved status and role within the company over the years. Clashes are preprogrammed if
multiple, status-driven people who think they deserve to run the place are suddenly forced
to team up. Transformation projects rarely cause conflicts; instead, they unveil existing
frictions. Some teams manage to sort things out. However, lasting cold feuds may be even
worse than employees fighting openly. If key stakeholders for a successful company transi
tion avoid or even sabotage each other, you end up in a stalemate. Full costs, but no prog
ress.”
“I am already doing so much more than I should in my role as a chairwoman. I do not want
to deal with interpersonal issues, too. Are you talking with HR about this? Isn’t it their job
to resolve conflicts?”
“HR can help to address issues between individuals. However, the source of these conflicts
is mostly a lack of clarity. Only executives have the power to resolve that situation. Start by
ensuring that there are no conflicting goals. One person’s yearly performance target might
be to make a company more secure, while another’s could be introducing innovation. To be
more innovative, the company needs to open up; to be more secure, the company needs to
be more closed. You see where I am going?”
“I understand,” Alice said, opening a notepad. “But what can I do?”
“You need to ensure that parties whose interests might diverge are still coerced to find a
middle ground. Be transparent about company goals and give parties all the tools to be suc
cessful. But be ruthless if employees start acting like characters in ‘Game of Thrones’.
Some executives hire ‘rockstars’ who are supposed to solve all problems, but do not give
them enough leverage to address organizational issues. The first rockstar fails to transform
the company; he might even become the scapegoat for most problems after he leaves. The
next rockstar is hired, and she faces the same problems as her predecessor. Don’t fall for
this trap! Some leadership responsibility cannot be delegated.”
“What else did you discover?”
“The data scientists do too much on an organizational level. They set up meetings to negoti
ate for getting access to cloud computing resources. I met Mahsa and Rashid, both compe
tent employees who could spend their time with more sophisticated problems in their do
main than with administrative tasks.
Being agile is more than working in iteration. It also means facing organizational chal
lenges. A Scrum Master’s job is to remove challenges for a team and take care of adminis
trative details. In good agile processes, in a daily standup, there are always three questions:
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.2 Storming 677
What did I do yesterday? What will I do today? And what prevents me from doing my work?
The answers to the last question are the Scrum Master’s work packages. Let data profession
als solve data problems while Scrum Masters take care of organizational issues.”
“I understand. What else?” Alice asked, her pen hovering expectantly over the notebook
page.
“I also found your product managers too disassociated. Martin seems to not care much about
the product, and Roza puts a lot of hope in an external vendor. Both should work closer with
the team.”
“I know. Martin is overworked and has recently faced changes in his private life. He often
complains that colleagues try to pull him into meetings. I fear he might burn out, so I ask
everyone not to put additional pressure on him.”
“Every product needs a Product Owner who defines its functionalities. Martin, for sure, does
not need to be involved in technical detail discussions, but he needs to provide a vision for
the team. And he cannot pull himself out by saying he is not interested in technology. In that
meeting, it seemed he was even doubting the product.”
“How do I resolve that?”
“When a team feels that an overworked Product Owner is not interested in a product, the
product will fail. If Martin is overworked, find someone to whom he can delegate work. If
Martin has difficulties delegating work, put him into leadership training focused on improv
ing that skill. Some employees who do great hands-on work fail to lead others; they believe
that becoming a leader is a natural reward for excellent operative work, and that it entitles
them to tell others what they need to do. But leadership is far more than that. Employees
who are being promoted must understand that leadership is a skill that needs to be learned,
like every other. And if they continuously do poorly at it, they do not deserve to be leaders.”
“How can I ask people to attend a leadership seminar if they already do not have enough
time to do their regular work?”
“Maybe you know the story of the lumberjack and the blunt axe. A lumberjack is struggling
to cut down a tree with an old, rusty axe. One of his friends approaches him and asks, ‘why
don’t you sharpen that thing? You will be done much faster.’ And the lumberjack replies,
‘I can’t. I have no time. I have to chop down this tree.’”
“Hmm,” Alice sat back in her chair, nodding slowly. “I will talk with Martin about this. What
about Roza? She is new to our company. We hired her because she is highly ambitious and
pushes for results. Although she might occasionally upset some colleagues, she does not
give up easily. I hesitate to restrain her, even if some members of her team told me she
sometimes might push for results too hard.”
“She wants results and faces an IT that blocks innovation. However, bypassing a disliked
department by not disclosing her talks with an external vendor is not an option. You might
want to define a RACI matrix to clarify individual and team responsibilities. I will explain
the details of what RACI means later. ”
“And what about Peter? He was really furious. I do not understand this; Roza pays this
external vendor from her own team’s budget. Why is he so bothered?”
“Put yourself in the shoes of an IT expert! Many remember painfully running some night or
weekend shifts to fix unexpected problems after a sudden change request. It might be sum
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
678 22 Creating High-Performing Teams
mer, your colleagues are having fun in their free time, and you sit in the office, hoping and
praying that an IT platform is finally fixed again. Occasionally, you might have no clue about
the root causes of a problem. And while trying to decrypt strange error messages, you al
ways have a Damocles sword over your head in case you do not get rid of the issues in time.
Think about it, and then you will understand why they ask to ‘never change a running sys
tem.’
You have two cultures: One party wants to deploy changes as fast as possible, and the other
wants to prevent them. The DataOps model tries to address this problem. We should talk
about this, too, in more detail.”
“Alright, let’s sum up for now.” Alice began circling keywords on her notepad as she contin
ued. “You mentioned agile processes, RACI Matrix, and DataOps as processes to improve
Halford. Let’s get started! Feel free to add more matching topics if they help me help my
team perform better.”
■ 22.3 Norming
22.3.1 Change Management and Transition
The decisions of the executive management shape an organization’s reality, including how
data departments are integrated into an existing organization. Different hierarchy struc
tures lead to different work cultures. Let’s compare some strategies to integrate teams to
establish a data-driven approach. Then we can explore some possible outcomes of the strat
egy decisions.
Company A hires a CDO and builds up a data team. The CDO reports to the CEO.
Company B hires data professionals with a Head of Data who leads them. The Head of
Data reports to the CIO, who reports to the CEO.
Company C assigns data scientists to every department without creating a specific data
department. The data scientists report to the department lead.
Company D splits its departments into operational units and Research and Development
(R&D). The data department is part of R&D.
Company E is an international company divided into a group organization and operating
companies (“Opco”s). Each Opco might have different regulations, goals, and structures.
Data scientists may face different situations depending on which organization they work
for.
In Company A, the CIO is the peer of the CDO. The CDO defines a data strategy based on the
CEO’s corporate strategy, and the CIO provides the infrastructure for data science. As such,
the CDO is a customer of the CIO, and the IT department, as an enabler, must build the
platform the CDO needs to accomplish the data strategy goals. The CDO can report directly
to the CEO if the CIO does not deliver what was promised, which increases the pressure on
the CIO.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 679
In Company B, as part of the IT department, the data team gets first involved in building a
technical platform for data science. Once the platform is ready, in the second step, the team
tries to find out which business challenges it should solve.
In Company C, departments work without a corporate-wide alignment. Data scientists may
work in parallel on overlapping problems without knowing it.
In Company D, the R&D department is too detached from operational units. Consequently,
the R&D employees sit in ivory towers with great ideas, but only a few of them address the
users’ pain points. In parallel, the operational units build their teams, which take care of
innovation. These newly formed teams compete internally with the R&D team.
In Company E, the situation is like that of Company D. However, with internationally dis
persed Opcos, the collaboration can become even more chaotic when a centralized group
unit tries to solve the problems of teams on the other side of the world, while the teams
themselves did not even know before that supposedly had a problem.
To avoid these problems, some companies hire a transformation manager to oversee the
migration of a corporate structure to become data-driven. As every transition involves
changing existing organizational structures, the transformation manager needs the strong
backing of the CEO and the board to be successful. Without leverage, a transformation man
ager is just a flag in the wind.
Integrating data science and artificial intelligence into a company impacts how a company
needs to operate. Some companies try to avoid restructuring by somehow fitting data agen
das into each department so that nobody loses authority. Ultimately, these companies will
operate similarly as before, but with greater expenses. Newly hired, intelligent people waste
a lot of time on tasks to avoid changes in a hierarchical structure, which would upset mainly
the team leaders. Often, these companies can only move forward by walking the line. For
every step, they need to weigh out possible consequences. One wrong step and some “land
mines” might detonate, which triggers explosions of anger in some departments, who feel
that their authority is questioned. Welcome to bureaucracy!
No book can advise on how to structure a company without knowing the company. However,
meek strategies, and fear of decision-making, have never made companies successful. To
enable the company to innovate in the future, executives must sometimes make hard deci
sions, which may be unpleasant for some individuals.
In the scenario above, Peter and Roza have a strong conflict of interest. A
decision maker might to try to look for a compromise. Peter stays the sole
sovereign over his domain, Roza can build her own small ecosystem outside of
Peter’s empire. To soften the blow for Peter, the decision maker might still
insist that Peter’s team is involved in the some of Roza’s activities, while also
providing them with veto rights. They may also see to it that no other data
science use cases are allowed on Roza’s platform, other than those from her
department. Peter might still be upset, as he now faces a small Gaul village
resisting his overall authority. Roza might resent the fact that although she has
now her platform, she still needs to incorporate Peter. She might fear that this
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
680 22 Creating High-Performing Teams
will slow her down, or that, without details, a veto right could mean anything.
Lastly, the decision maker might decide it’s best not to upset Peter further,
and might delay any decisions regarding the cloud platform the data scientists
are asking for. This decision maker perhaps asks the data scientists to find
some way to integrate their work into the databases that Peter provides for
them. If they could bring evidence that there is really no way to work with
Peter’s databases, the decision maker would be open to talk again with the
data scientists about alternatives to Peter’s platform.
Such a scenario would be a compromise that does not solve anybody’s prob-
lem. And too many such decisions might bring chaos to a company. Data
scientists still might try to circumvent Peter and get their work done through a
backdoor, or they might revert to go into a “service according to regulation”1
mode, in which progress inevitably gets sluggish.
Lack of courage, too many compromises, missing transparency about process-
es, complex relationships and strange behavior by managers are all ingredi-
ents for creating a toxic work culture. Trying to transform a toxic company to a
data-driven one without addressing the issue is impossible. To prevent toxicity,
you can use evidence-based approaches with data to measure the toxicity
level of your company. The minimum you can do is to send regular surveys to
employees and to ensure that HR evaluates them carefully and addresses red
flags. With time AI will provide us with more ways that are employee-friendly,
legal and that ensure zero-tolerance for toxicity.
As an employee, who has no leverage to influence this situation, it often helps
to read books from the Ancient Stoics who were faced with a lot of uncer
tainty during their lives, to learn how to deal with situation that are hard to
change. Seneca, Marcus Aurelius and co. might be able to give better advice
than any excited manager who tells you you’re working for the greatest
company in the world; especially when things simply don’t feel that way.
In the scenario above, employees assumed that they had specific rights and responsibilities,
maybe without evidence. The IT manager claimed ownership of all IT-related activities at
Halford and expected nobody to introduce a new data platform.
While Roza acknowledged Peter as the owner of the corporate IT, she assumed she had the
right to choose the platform she wanted for her department. She saw possible risks as well
managed, given that an external company was accepting responsibility for operating the
platform on its own infrastructure. In addition, Roza paid for this POC from her own depart
ment’s budget.
1
Called “Dienst nach Vorschrift” in German, this represents a state where employees do only what they are told to,
and do not show any self-initiative anymore.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 681
Some might pick Peter’s side. As he is responsible for IT, Roza should not decide on a new
vendor without knowing the efforts needed to integrate this platform into remaining corpo
rate IT. Others might stand with Roza. If her job is to innovate and IT does not provide her
with the necessary tools, she must find another way.
Alice can resolve this situation. She tells both parties she expects them to collaborate better.
She gives Roza the right to decide on platforms independently from IT, but also insists Roza
consult Peter before deciding on a platform.
As shown in Figure 22.1, a RACI matrix visualizes who is accountable, responsible, in
formed, or consulted.
In the first phase, Proof of Concept, the platform is being tested to determine whether it
meets the team’s needs. The vendor is responsible for operating the platform. Users who
face technical problems with the platform, such as performance or access issues, will
approach the vendor. As an accountable person, Roza pays for the platform through her
budget. Assuming Peter has security concerns regarding the platform, Roza decides to use
it, nonetheless. As an accountable person, if there’s a security breach, it will weigh harder
on her if the breach could have been prevented by following Peter’s advice. Depending on
the impact of the security breach, it could get very unpleasant for Roza.
In the second phase, implementing the data use case, data is being integrated from opera
tional systems that Peter owns. Peter is responsible for extracting the data from the source
system and providing it to the data scientists. The data scientists will sign the data inte
gration off and are therefore accountable. Some companies might adjust the acronyms to
their needs. They could, for instance, add an ‘E’ for escalation. Then, it would be clear whom
the data scientists can address in case Peter fails to do his job.
The data scientists analyze the data, and Roza signs off on this phase once she is happy with
the results. After that, the platform is put into production, making it available to a larger
audience. Here, the data scientists should also be consulted, as they might have some input
from their experience during the previous phases.
A complete RACI matrix might add additional phases and details. RACIs can be defined, for
example, to protect data privacy or user integration. Of course, multiple parties can also be
added to the ‘C’ and ‘I’ columns, as often multiple parties need to be informed or consulted.
A RACI matrix is the first step to creating clarity by defining who is responsible for what.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
682 22 Creating High-Performing Teams
22.3.3 SMART
The IT manager, Peter, promises to deliver a data platform that solves all problems soon. In
the meeting scenario we just witnessed, nobody was confident about his promises:
Peter was not specific about what his platform does.
Success was not measurable.
Many employees doubted that he would succeed in delivering a platform, given how he’d
continuously stalled in the past.
The team did not know if the platform would solve the team’s relevant problems.
And Peter did not give a deadline for when he would be done.
SMART is an acronym that defines five quality criteria for a well-formed goal. It is a com
mon practice in many project management systems, and can help clarify the intended out
come of tasks and projects.
Imagine you receive an email from Bill. How would you compare his claim to Peter’s?
“Hi!
My name is Bill. I took over the data platform topic and was asked to develop a new plan.
By the end of this quarter (TIME), we plan to launch a new data platform on Microsoft Azure3 that
allows all data professionals to store unstructured data in the cloud on blob storage. We will also
set up Azure Databricks (SPECIFIC). We understand this is the preferred platform for the data
scientists (RELEVANT). I have agreed with the data science team on acceptance criteria to validate
performance, robustness, and other metrics. Our team has engineers who did similar tasks at the
same time interval (ATTAINABLE).
Bill.“
2
Copyright: https://www.bitesizelearning.co.uk/resources/smart-goals-meaning-examples
3
In this example, we picked some concrete technologies to underline that this goal is specific. Bill might choose
other technologies as well.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 683
Even though the SMART goal is still just a goal and not yet a reality, it already helps the
teams to plan. They might now ask what they need to do to get data from the operative
source system onto the platform. Again, such a goal can be defined as a SMART goal. If the
data scientists receive such an email, they will be far more confident in accomplishing their
work than hearing Peter’s remark that his platform will be done ‘when it is done.’
Defining reasonable SMART goals improves with practice. Everyone can write down their
first idea and refine it until all criteria are met. SMART goals are also crucial in specifying
Scrum Epics4, larger and more complex features or fixes for a project, that need to be broken
down into more manageable user stories in agile processes. The better that Epics are for
mulated, the SMARTer we can define what we want, and the SMARTer we can find out how
we get there.
When people talk about agile processes, they often discuss methodologies that help teams
organize themselves, such as Scrum and Kanban. While Kanban gives more freedom,
Scrum gives you more of a framework with some principles. Three Scrum principles might
help Halford solve its problems:
Work in iterations: At the end of each iteration, deliver a working product (we are not
talking explicitly about a final product to be released to everyone; a working product can
be some intermediary results that are executable). The main goal of a working product is
to show progress, which continuously develops a product. In our scenario, working in
iterations would make transparent that the data teams have almost no progress at all, as
they are blocked internally.
Keep your data professionals focused: In pure software development teams, we would
call these “developers”, but in data teams, we have more job titles. This usually involves
having one person on the team, a Scrum Master, dedicated to fixing all non-technical
problems that are blocking the technical employees. In our scenario, the Scrum Master
would have coordinated the meeting about getting access to a data platform, freeing up
the data scientists to focus more on their core skills.
Integrate the Product Owner into a team: A Product Owner defines what features need
to be built into a product in order for it to be successful. By keeping the Product Owner
close, the team guarantees that the output resembles the company’s needs. In our sce
nario, Roza or Martin would be accountable for the product, and they would feel more
strongly the need to support the data team.
4
https://www.workamajig.com/blog/scrum-methodology-guide/scrum-epic#:~:text=A%20Scrum%20Epic%20is%20
a,when%20creating%20their%20sprint%20goals
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
684 22 Creating High-Performing Teams
In an episode of the sitcom The IT crowd, a fire breaks out in the IT depart-
ment. One of the characters, Maurice Moss, has an unorthodox way of deal-
ing with it: he sends an email to report the fire.
Everyone knows at least one tech person who might act quirkily in social
situations. Instead of talking directly to their counterparts, they might send
cryptic emails back and forth, or avoid getting anything done that requires a
bit of empathy. Sometimes, we feel relief when these individuals are kept
from tasks that would require interpersonal interaction and social skills.
Everyone knows also at least one tech person who handles every social
situation well. Still, why should a smart engineer, who was hired to solve
complex technical problems, be expected to spend all their time in meetings
about administrative topics? As humans are not made for multitasking, the
overall performance of data professionals who are bombarded with too
many trivialities will decline.
Part of a Scrum Master’s job is to remove the administrative overhead from
technical teams so that they can focus on their core tasks. As a natural
problem solver, they help to get the formalism done to get access to data;
they organize meetings, and perform other administrative tasks, that might
block data professionals from doing their work.
The ultimate goal of Scrum is to keep the team in a constant learning and improvement
cycle with every sprint. The best Scrum Master is a person who understands how to orga
nize ceremonies and rituals, such as Sprint Reviews and Retrospectives, so that the ways
of working get smarter and the team stays passionate to outperform.
Imagine that Halford hires Syuzi, a Scrum Master. Syuzi is outgoing and vibrant, and she
enjoys working with clever data professionals. With experience in many other IT compa
nies, she knows how to deal with bureaucracy and find pragmatic ways to get things going.
But let’s read her report after six months.
“Hi there!
This is Syuzi. My Scrum team now consists of two data scientists and two data engineers. Martin
has delegated the role of Product Owner to Ada, who reports directly to him. We also get support
from a DevOps team and a freelance user interface artist.
We work in two-week sprints. We start each sprint with a sprint planning meeting to align on goals.
At the end of our sprint, we work towards a running system and present the results to Ada. Ada
knows the users’ needs well and always gives us important feedback at the end of each demo. She
keeps collecting their wishes and requirements, so we always have a full backlog of work items,
with clear priorities.
In the Daily Standup every morning, each team member answers what they did yesterday, what
they will do today, and what might prevent them from succeeding. Part of my job is to remove such
roadblocks.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 685
At the end of each sprint, we have a Sprint Review and a Sprint Retrospective. In the review, we
collect our performance statistics and compare them with previous sprints. We try to improve
our performance with every sprint. To ensure that we can do that, we discuss in the retrospective
what went well and what can be improved.
We also have pizza night and play board games every second sprint. The social factor is important
to keep the team’s energy vibrating.
Syuzi”
Some experts believe that Kanban might be better suited to some data teams. Scrum is often
not the best choice for these teams because stories like “explore this new data set” or “try
out these different ML algorithms for this problem” are usually hard to estimate and hard to
fit into a sprint. Kanban’s pull-based system focuses more on visualizing and managing
flow, allowing continuous and on-demand task initiation. Data teams often work in very
similar flows. First, you need to get permission to access data, and then you extract and
transform it. Once the data is ready for analytics, models are generated and continuously
reviewed. In Kanban, such workflows are visualized in phases, and Kanban visualizes if
teams tend to get stuck in one phase.
The Halford meeting scene revealed many communication and organizational mistakes, so
it’s worth reflecting on them, before moving on. Read over the scene again. Can you identify
all the mistakes that were made?
Enabling Professionals
In the meeting scenario presented earlier, we highlighted that the data scien-
tists used their private notebooks to get their work done. Data professionals
have different requirements for notebooks than regular users. Subjecting
data professionals to bad tools and other constraints is like buying a sports-
car to win the Grand Prix, but going into the race with flat tires.
Strong and modern notebooks, two monitors, a fancy office space and many
other small benefits show employees that they are appreciated, and, that as
a company, you also expect them to perform well.
Have you made your list of possible mistakes in the meeting? Here is a list of things that
could be improved:
The meeting had no written agenda.
The meeting room was not booked in advance.
Colleagues might have made themselves unavailable for meetings by blocking their cal
endars.
Colleagues cancelled meetings at the last second.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
686 22 Creating High-Performing Teams
22.3.6 DataOps
When we understand a business unit’s goals, we can address issues and create successful
processes to ensure that other teams do not accidentally step on their toes. The results are
phases, as shown in Figure 22.3, that help guide engineers on what to do within the stages
of a data project.
5
https://customsitesmedia.usc.edu/wp-content/uploads/sites/370/2015/05/17225200/Basic-Guide-to-
Conducting-Effective-Meetings.pdf
6
https://hbr.org/1976/03/how-to-run-a-meeting
7
https://sixth.ucsd.edu/_files/_home/student-life/conducting-successful-meetings.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 687
Let’s get some perspectives from a business unit perspective, and define some mantras for
each:
IT: Never change a running system.
InfoSec: Tell us exactly how you plan to use data, and we will let you know if we are okay
with it.
Users of AI Products: Give us insights to be more successful in our day-to-day work.
Business: Reduce costs, enter new markets, and make faster decisions.
Data Professionals: Get the job done without bureaucracy.
In Chapter 3, we introduced the concept of software containers. Engineers deliver a com
plete deployment package, including a mini operating system and business logic, as con
tainer images. These images run as containers in host systems, such as Kubernetes, pro
vided by the IT administration. Imagine having Kubernetes as a secure sandbox that handles
how containers are executed so they can do no harm and fulfill their intended functionality.
New deployments do not require a change of a running system. Using these DevOps prin
ciples resolves the conflict of interest between IT and Data.9
In Chapter 1, we introduced the Head of Security, whose job is to ensure clients’ data pri
vacy and to protect the infrastructure against cyberattacks.10 In the scenario at the begin
ning of this chapter, the data scientists simply experimented with data without getting any
information security approval.
Like IT and Data, where one party may want to slow things down, and the other wants to
create fast results, there is a similar conflict between Data and InfoSec. Security audits can
delay data projects. Data scientists would love to get their hands dirty, but security opera
tives insist that various questions must be clarified before the data scientists can even get
near the data. During meetings to resolve privacy questions, additional questions may come
up that delay data projects even further.
8
Copyright: https://www.montecarlodata.com/blog-what-is-dataops/
9
Although containers are associated with DevOps, working with data includes development and analytical
practices. Therefore, DevOps can be seen as a subcategory of DataOps, too.
10
In some companies, these roles are separated into information security and platform security.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
688 22 Creating High-Performing Teams
DataOps can help set up an environment that satisfies both parties. Nobody, including secu
rity operatives, wants to be held up for weeks — or even months — discussing opening fire
wall ports or how data should be anonymized. Fortunately, there are common practices to
maintain a project’s performance and address these problems using known principles.
These principles can include:
Common Principles
Efficient Processes
Synthetic Data Labs
Let’s now look at each in more detail.
11
https://www.cmu.edu/iso/governance/guidelines/data-classification.html
12
https://security.berkeley.edu/data-classification-standard
13
https://www.angelo.edu/administrative-support/information-technology/it_policies/dataClassificationStandard.
php
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.3 Norming 689
An age category (e. g., between 45 and 60) instead of the birthdate,
An occupation category instead of specific job title.
If irrelevant attributes for use cases are unnecessary, they can also be removed. This can
also shrink the data size. Smaller data might also benefit data pipelines that continuously
transform PII data into anonymized data sets.
Statistical anonymity measures such as k-anonymity, l-diversity, and t-closeness to validate
a generalized data set from PII returns the chance that a person in an anonymized data set
can still be identified. The moment data is not classified as PII anymore, data scientists have
the freedom to work with it without fearing violating data protection laws.
An alternative to extracting and anonymizing data from source systems is to use synthetic
data, an artificial version of real data, so that data scientists can still train models without
concerns of violating privacy laws.
If a data transfer is intercepted while it is being transmitted, or if hardware such as note
books is stolen, encryption is one way to ensure that leaked data cannot be exploited. There
fore, DataOps can also contain best practices to ensure data encryption. While most IT de
partments will take care of the physical encryption of hardware (through drive encryption,
for example), data teams need to ensure that they use encryption when they send data
through channels that can be intercepted. The most prominent standard is Transport Layer
Security (TLS). This book introduces platforms such as Apache Kafka and others, which
can also be configured to encrypt data. A DataOps principle, therefore, could be to use TLS
whenever possible.
Every platform that contains sensitive data will provide access to users who can provide the
required credentials, such as passwords or tokens. While these platforms can be configured
to maximum security using two-factor authentication and similar methods, a platform
cannot force users to store their credentials safely. To ensure that passwords and other sen
sitive information are stored safely, every data team can use secret managers, often also
called vaults.
There is no golden recipe that explains in detail how teams need to integrate security into
data projects. However, as is usual in agile projects, the rule “individuals and interactions
over processes and tools” can help establish the ideal way to proceed.
■ 22.4 Performing
We started this chapter with a scenario in which employees at Halford got into a big fight.
Let’s envision another scenario, where the Halford teams have aligned on a growth mindset
and are now performing well.
We do not know the details of what happened between then and now. Maybe they were all
hypnotized by a Tony Robbins-like performance trainer. Maybe they all went deep into the
woods to jump naked around a fire—some less popular manager might have “accidentally”
almost landed in the fire—and firewalked over hot stones while reciting the wisdom of spir
itual teachers. Whatever it was, it required much effort, empathy, and patience. And most
likely, it took responsible people pushing themselves even further from their comfort zones
than they might have in the examples just listed.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.4 Performing 691
The Adventurers are the first Scrum team formed at Halford for the transition. They work in
two-week sprints, and now they are meeting for a Sprint Retrospective. Syuzi, the Scrum
Master, is moderating the ceremony. Maria, who is considering joining team Adventurers as
part of a team swap (a new development initiative HR has set up), has been invited as a
special guest.
“Hi Maria! Pratima told me you’re considering joining our team. That’s great. She also asked
me to share a bit about how we work.
We sometimes see ourselves as slightly crazy people and are proud of that. We are passion
ate about what we do, and this requires us to be off track once in a while. You know. ‘Work
hard, play hard’. We want to get things moving, see results, and be proud of our work. If
needed, we also challenge the organization occasionally, but that is okay because as long as
we deliver the results, no manager will object to us stirring up things.
Another motto is ‘come as you are’, which goes well with the ‘play hard’ thing. Do not try to
clone a stereotype; just be who you are. We do not care about skin color, gender, sexual ori
entation, or nationality. All you need to be is passionate about the craft. Although we are
forgiving if someone has the occasional bad day, everyone on the team should treat each
other with warmth and respect. If you meet our team members, you will immediately feel
they are personalities.
Our Sprint Retrospective best reflects our growth culture as it focuses on continuous im
provement. One of our team members is a sports fanatic. He keeps reminding us of a quote
by the ice hockey player Wayne Gretzky: ‘After every day, no matter how successful you are,
you can always aim to be even better the next day. If we believe in that and keep our routine
of continuous improvement, we will succeed.’ This growth mindset has become part of our
team’s DNA. Some teams have ‘taboo topics’, elephants in the room that everyone is afraid
to address because they are afraid to hurt anyone’s personal feelings. In our team, it is dif
ferent. We cultivate feedback, and we never criticize personal traits, only behavior. This
openness and respect for each other leads to trust, and once you have team members that
trust each other, you can move mountains.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
692 22 Creating High-Performing Teams
Figure 22.4 shows the result of this Sprint Retrospective. Without context of the tasks the
team is working on, it is impossible to understand what the various Post-it notes mean in
detail. But don’t worry: it’s more important that you understand the concept, and, hopefully,
try it out in your own team!
The scenario in “50 Shades of Red” represented a company with a fixed mindset. The em
ployees cared a lot about their status, and paid little attention to other people’s challenges.
There was a culture of fear, as employees tried to intimidate each other. In real life situations
like this — which the author has had the misfortune to witness! — the dialogs are often much
more extreme, and inappropriate to reproduce here.
Suppose you do not focus on creating an open communication culture and cultivating a
growth mindset. How can you expect to solve complex problems with data science and AI if
teams cannot talk openly about challenges?
14
Own Drawing
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.4 Performing 693
The stone cutters all work at the same site and perform the same task, but each has a differ
ent perspective. Imagine yourself now in a data team where employees only focus on the
details of their work, or their income. You might hear statements like, “I do what I am told
without questioning anything,” or, “as long I get paid, I do not care about what I do.”
Some jobs require employees who follow orders. Data professionals solve complex problems
to automate these kinds of jobs. They create an environment that encourages creativity and
unconventional thinking. Every leadership team should want professionals who take pride
in their craftsmanship, their work, and what they create.
The team members of successful teams are open to each other. They share more open
thoughts, and occasionally, they address tricky topics. At the same time, their goal is to
grow and learn; they will embrace feedback as an opportunity, not as a threat. The idea of
focusing on continuous improvements is not new.15
Exercises
Find out how data is embedded in your company’s hierarchy. Do any of the
possible problems we outlined in the section “Change Management” apply
to your organization?
Can you think of a common process in your organization that you would
love to model using a RACI matrix? Try it out: how does the matrix look to
you? Which ideas did you gain along the way about how to improve the
process?
Next time you create a to-do list or define goals within your team, check if
those goals are SMART.
This chapter explored the sprint retrospective ceremony, and touched on
sprint planning, daily standup, and sprint review. Explore these Scrum
ceremonies yourself in more detail, and take a look at product backlog
grooming, one ceremony we did not cover here.
We mentioned that there are different opinions on whether teams should
use Scrum or Kanban. Explore the pros and cons of both: which method
ology matches better to your team?
15
https://jamesclear.com/atomic-habits-summary
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
694 22 Creating High-Performing Teams
Some engineers argue that Kanban might be better for data teams, since,
unlike software engineering teams that build large applications according
to a defined roadmap, data projects are often more dynamic and less
structured. If you already have some experiences with both scenarios,
what is your opinion on this?
Try to set up feedback sessions to discuss the efficiency of past meetings
and how they can be improved. You may like to research some advanced
feedback techniques beforehand, such as the Sandwich feedback tech-
nique, to help you prepare. Some experts propose a structured approach
for feedback sessions that requires everyone to follow specific behavior
and rules, such as that people must always focus on behavior and not on
personality. How much can a structured feedback process help to improve
the quality of feedback?
Imagine you want to give feedback to someone who outranks you. The
person did not ask for it, but you feel it is necessary, as their actions are
repeatedly frustrating a lot of people. How would you prepare and deliver
your feedback?
Create a data set that contains your personal information. Try to add be-
tween five and ten attributes of personal data to it. Then try to anonymize
it using the techniques presented here. Imagine you send your anonymized
data along with other anonymized data sets to people who know you well.
Think of ways these people might still be able to identify you, given this
data set. For example, you may have converted your address to a post-
code, your age to an age range, and your job to a job category. But what if
you’re the only 18-35 year old female person working in IT, in your small
town? You could still easily be recognized. What alternative data technique,
mentioned in this chapter, could be used as an improvement on ano-
nymization?
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
22.5 In a Nutshell 695
■ 22.5 In a Nutshell
16
James Clear
23
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
“From a business perspective, effective regulation and governance of AI are not only
an ethical obligation but also a strategic necessity for companies to build trust,
minimize risks, and ensure long-term success.”
Jeannette Gorzala
Important Note
This chapter does not constitute legal advice. It is an interpretation of the Arti-
ficial Intelligence Act (AI Act)2 applied to a certain set of artificial intelligence
technology as understood by the authors as at April 2024. Since this chapter
is intended as a high-level overview introduction, it is recommended to
seek legal advice in relation to specific use cases and AI systems that shall
be developed by providers or implemented by deployers.
1
Author of the part: AI and criminal liablity.
2
At the time of writing this chapter, the legislative text of the AI Act has already been adopted by the EU Parliament.
All references and this chapter are based on the text version adopted by the EU Parliament. However, linguistic
procedures are still ongoing and the legislative process in relation to the AI Act will be finalized only after the
endorsement of the EU Council. Consequently, there might be deviations in numbering, references or wording
compared to the finalized text version of the AI Act.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
698 23 Artificial Intelligence Act
■ 23.1 Introduction
The technologies summarized under the term ‘artificial intelligence’ (AI) are expected to
have a profound impact on our economies, societies, and environment. Given the rapid tech-
nological development providing for both, significant opportunities are emerging, but also
new risks. So the European Union (EU) initiated the process to lay down a harmonized
regulatory framework for certain AI use cases and applications.
As a first step, the European Commission appointed a group of experts to provide advice on
its AI strategy. On 8 April 2019, the High-Level Expert Group on AI (AI HLEG) presented its
Ethics Guidelines for Trustworthy Artificial Intelligence3 (Guidelines). Briefly summarized,
the Guidelines determined the following seven key principles for AI systems that should be
met in order for a system to be deemed trustworthy: (i) human agency and oversight, (ii)
technical robustness and safety, (iii) privacy and data governance, (iv) transparency, (v) di-
versity, non-discrimination and fairness, (vi) societal and environmental well-being, and
(vii) accountability. The Guidelines were of non-binding nature and could be implemented
on a voluntary basis.
Despite the existence of the Guidelines, failures of organizations using AI systems have led
to significant harm caused in the European Union and beyond. The database of AIAAIC4 (AI,
Algorithmic, and Automation Incidents and Controversies) tracks incidents related to the
unethical use of AI. According to the Artificial Intelligence Index Report of 2023 published
regularly by the Stanford Institute for Human-Centered Artificial Intelligence, the number
of AI incidents and controversies has increased by a multiple of 26 times since 2012 based
on the AIAAIC database.5
Responding to risks and challenges of AI, legislative action was needed to ensure a well-func-
tioning EU market for AI systems where both benefits and risks are adequately addressed.
On 21 April 2021, the European Commission put forward the worldwide first proposal for a
legally binding horizontal regulation of certain AI use cases to establish a common frame-
3
Ethics Guidelines for Trustworthy Artificial Intelligence, published by the High-Level Expert Group on AI on 8 April
2019. Available online under: https://ec.europa.eu/newsroom/dae/document.cfm?doc_id=60651 (31.03.2024).
4
AIAAIC (AI, Algorithmic, and Automation Incidents and Controversies) is an independent, non-partisan, public
interest initiative that examines and Executive Order on the Safe, Secure, and Trustworthy Development and Use
of Artificial Intelligence (31.03.2024).
5
Stanford Institute for Human-Centered Artificial Intelligence (2023) Artificial Intelligence Index Report 2023.
Available online under: https://aiindex.stanford.edu/wp-content/uploads/2023/04/HAI_AI-Index-Report_2023.
pdf (31.03.2024).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
23.1 Introduction 699
work for the deployment and supply of AI systems—the Artificial Intelligence Act6 (AI Act).
A political agreement on the draft of the AI Act was reached on 8 December 2023 after his-
toric trialogue negotiations lasting for about 37 hours.7 In the EU Parliament’s plenary vote
on 13 March 2024 the final agreement was adopted with an overwhelming majority of 523
votes in favor, 46 votes against and 49 abstentions.8 Before the AI Act will enter into force,
the AI Act must still be endorsed by the European Council, which is expected for May or
June 2024. After publication in the Official Gazette, the provisions of the AI Act will enter
into force in a staggered regime, depending on the risk classification of the AI system, be-
tween 2024 and 2027.
The AI Act is designed as a European regulation. As such, and in accordance with the Treaty
on the Functioning of the European Union (TFEU) (Art 288 TFEU), the AI Act will be directly
applicable in all 27 EU Member States, meaning that it will come into force and be legally
binding without any additional implementing acts or action required on the part of Member
States.
The rise of AI, especially the developments driven by generative AI, have led to a global
conversation on AI regulation and AI governance. Selected significant global developments
in parallel to the AI Act in particular include the issuance of the Executive Order on the
Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence by US Presi-
dent Biden on 30 October 20239, the Bletchley Declaration10 of 1 November 2023 executed
by 27 countries and the EU upon attending the AI Safety Summit in the UK and the Hiro-
shima Process International Guiding Principles for Organizations Developing Advanced AI
system11 and the Hiroshima Process International Code of Conduct for Organizations Devel-
oping Advanced AI Systems12. The goal of the chapter is to provide an overview on the reg-
ulatory framework of the AI Act.
6
Proposal for a Regulation of the European Parliament and of the Council laying down harmonized rules on artificial
intelligence (Artificial Intelligence Act), COM(2021) 206 final 2021/0106 (COD). Available online under:
https://www.europarl.europa.eu/RegData/docs_autres_institutions/commission_europeenne/com/2021/
0206/COM_COM(2021)0206_EN.pdf (31.03.2024).
7
Press Statement by Statement by Commissioner Breton—The European AI Act is here! Available online under:
https://ec.europa.eu/commission/presscorner/detail/en/STATEMENT_23_6471 (31.03.2024).
8
European Parliament Press Release, Artificial Intelligence Act: MEPs adopt landmark law on 13 March 2024.
Available online under: https://www.europarl.europa.eu/news/en/press-room/20240308IPR19015/artificial-
intelligence-act-meps-adopt-landmark-law (31.03.2024).
9
Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence by US
President Biden dated 30 October 2023. Available online under: https://www.whitehouse.gov/briefing-room/
presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-
artificial-intelligence/ (31.03.2024).
10
The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023. Available online
under: https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-
bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023 (31.03.2024).
11
Hiroshima Process International Guiding Principles for Organizations Developing Advanced AI system. Available
online under: https://www.mofa.go.jp/files/100573471.pdf (31.03.2024).
12
Hiroshima Process International Code of Conduct for Organizations Developing Advanced AI Systems. Available
online under: https://www.mofa.go.jp/files/100573473.pdf (31.03.2024).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
700 23 Artificial Intelligence Act
13
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection
of natural persons with regard to the processing of personal data and on the free movement of such data, and
repealing Directive 95/46/EC (General Data Protection Regulation, GDPR).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
23.3 Scope and Purpose of the AI Act 701
specific objectives, and logic- and knowledge-based methods, which draw conclusions from
encoded knowledge or symbolic representations of tasks to be solved.
In an overall assessment, the legal definition of an AI system in the AI Act is very broad.
While the preamble suggests a focus on machine learning and logic- and knowledge-based
approaches, other techniques may also fall within its scope. As a result, each advanced
software or algorithm must be evaluated individually, potentially falling under the purview
of an AI system as per the AI Act.
Consequently, the obligations of the AI Act concern both providers of AI systems (e. g., a
developer of a CV-screening tool) and deployers of AI systems of high-risk AI systems (e. g.,
a bank buying this screening tool and implementing it in its organization). Also, the AI Act
takes into consideration the entire AI value chain, by also applying to importers of AI sys-
tems into the EU and distributors.
Importantly for academia and research, the AI Act does not apply to AI systems or AI mod-
els, including their output, specifically developed and put into service for the sole purpose
of scientific research and development. Additionally, the obligations of the AI Act do not
apply to any research, testing or development activity regarding AI systems or models prior
to their being placed on the market or put into service. However, despite being out of the
scope of the AI Act, such activities need to observe other applicable laws (e. g., data protec-
tion, intellectual property laws, etc.). As testing of AI systems in real world conditions is
concerned, the AI Act foresees a special regime and requirements and therefore covers this
specific aspect.
AI systems where and insofar they are placed on the market, put into service or used exclu-
sively for military and defense or national security purposes, regardless of the type of entity
carrying out those activities, are carved out of the scope of the AI Act.
The AI Act regulates AI systems in accordance with a risk-based approach, with four levels
of risk categories and corresponding obligations for AI systems as follows:
Prohibited AI practices: AI systems associated with unacceptable risk shall become
prohibited AI practices under the AI Act.
High-risk AI systems: AI systems associated with high risk shall be subject to extensive
compliance by AI system providers and deployers of such AI systems.
Medium risk AI systems: AI systems, associated with medium risk shall be subject to
certain transparency and disclosure requirements.
Low or no risk AI systems: AI systems associated with minimal or no risk shall not be
further regulated by the AI Act, however may follow voluntary codes of conduct and best
practices.
The European Commission depicts the framework and obligations of the AI Act as a pyra-
mid, with risk—and hence obligations—decreasing from top to bottom.
Additionally, the AI Act identifies risks specific to so-called general purpose AI models and
lays down requirements for providers of these models as well as certain transparency re-
quirements for their output. Similar to the definition of an AI system, the definition for
general purpose AI models is very broad, applying to models currently also referred to as
foundation models, generative AI and models with frontier capabilities (as outlined in more
detail in Section 23.4).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
23.3 Scope and Purpose of the AI Act 703
A limited set of particularly harmful uses of AI that contravene European values due to vio-
lations fundamental rights shall be banned in the EU and shall be considered prohibited AI
practices (Art 5 AI Act).
Summarized, prohibited AI practices are the following:
a) Subliminal techniques: AI systems deploying subliminal techniques beyond a person’s
consciousness or purposefully manipulative or deceptive techniques, with the objective,
or the effect of, materially distorting the behavior of persons by appreciably impairing
their ability to make an informed decision, thereby causing a person to take a decision
that that person would not have otherwise taken in a manner that causes or is likely to
cause that person significant harm.
b) Exploitation of vulnerabilities: AI system exploiting vulnerabilities of persons due to
their age, disability, or a specific social or economic situation, with the objective, or the
effect, of materially distorting the behavior of persons in a manner causing or is reason-
ably likely causing persons’ significant harm.
c) Social scoring for public and private purposes: AI systems for the purpose of the
evaluation or classification of natural persons or groups of persons over a certain period
of time based on their social behavior or known, inferred or predicted personal or person-
ality characteristics, with the social score leading to either or both of the following:
14
Source: European Commission (https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
704 23 Artificial Intelligence Act
A limited number of AI systems defined in the AI Act, potentially creating an adverse impact
on people’s safety or their fundamental rights (as protected by the EU Charter of Fundamen-
tal Rights), are considered to be high-risk. The risk classification is based on the intended
purpose of the respective AI system, in line with the existing EU product safety legislation.
The classification of the risk depends on the specific function performed by the AI system
and on the specific purpose and modalities for which the system is used.
High-risk use cases defined in Annex III of the AI Act include:
a) Remote biometric identification systems, insofar not prohibited, biometric categoriza-
tion systems based on sensitive or protected attributes or characteristics and emotion
recognition systems.
15
AlgorithmWatch, Final EU negotiations: we need an AI Act that puts people first. Available online under:
https://algorithmwatch.org/en/final-eu-negotiations-on-ai-act/ (31.03.2024). AlgorithmWatch and 26 other civil
society organizations called the German government to stand up in the negotiations on the AI Act and advocate
for a strong ban on biometric surveillance as mirrored in its coalition treaty. The open letter is available under:
https://algorithmwatch.org/de/wp-content/uploads/2022/11/OffenerBrief_AlgorithmWatch_BReg_2022-11-
17_ENG.pdf (31.03.2024).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
706 23 Artificial Intelligence Act
b) Certain critical infrastructures in the fields of road traffic and the supply of water, gas,
heating and electricity as well as critical digital infrastructure.
c) Education and vocational training, in particular used to determine access or admission
or to assign natural persons to educational and vocational training institutions at all
levels, for the evaluation of learning outcomes, for assessing the appropriate level of ed-
ucation that an individual will receive and used for monitoring and detecting prohibited
behavior of students during tests.
d) Employment, workers management and access to self-employment, in particular to
place targeted job advertisements, analyze and filter job applications, and to evaluate
candidates as well as intended to make decisions affecting the terms of work-related
relationships (e. g., promotion, termination).
e) Access to essential private and public services and benefits, in particular essential
public assistance benefits and services including healthcare and creditworthiness eval-
uation of natural persons as well as risk assessment and pricing in relation to life and
health insurance.
f) Evaluation and classification of emergency calls.
g) Certain systems used in the fields of law enforcement, in particular to assess natural
persons’ risks of becoming victims or criminal offenders, polygraphs and similar tools,
evidence evaluation, use cases of predictive policing and profiling.
h) Migration, asylum and border control, in particular polygraphs and similar tools, sys-
tems to assess risk, examine applications for asylum, visa or residence permits and for
detecting or recognizing national persons with the exception of verification of travel doc-
uments.
i) Administration of justice and democratic processes, in particular systems to be used
in researching and interpreting facts and the law and in applying the law to a concrete
set of facts.
j) Elections, in particular AI systems intended to be used for influencing the outcome of an
election or referendum or the voting behavior of natural persons in the exercise of their
vote in elections or referenda; however, not including AI systems to the output of which
natural persons are not directly exposed, such as tools used to organize, optimize or
structure political campaigns from an administrative or logistical point of view.
Systems on the high-risk list, that perform narrow procedural tasks, improve the result of
previous human activities, do not influence human decisions or do purely preparatory tasks
are not considered high-risk. However, an AI system shall always be considered high-risk if
it performs profiling of natural persons.
Additionally, to the high-risk areas defined in Annex III and summarized above, AI systems
and AI systems as safety components of products covered by sectorial EU legislation are
considered high risk (Annex II). They will always be considered high-risk when subject to
third-party conformity assessment under that sectorial legislation. The list of sectorial leg-
islations in Annex II includes 10 regulatory frameworks, in particular medical devices, in
vitro diagnostic medical devices, cableway installations, lift components, personal protec-
tive equipment and more.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
23.3 Scope and Purpose of the AI Act 707
High-risk AI systems will be subject to strict obligations before they can be put on the mar-
ket. Providers of high-risk AI systems in particular need to adhere to the following require-
ments in relation to their AI systems:
foreseeing adequate risk assessment and mitigation systems,
using high quality of the datasets to minimize risks and discriminatory outcomes,
providing for logging of activity to ensure traceability of results,
preparing detailed documentation providing all information necessary on the system and
its purpose for authorities to assess its compliance,
drawing up and providing clear and adequate information to the deployer,
foreseeing appropriate human oversight measures to minimize risk,
ensuring high level of robustness, security and accuracy.
Before placing a high-risk AI system on the EU market or otherwise putting it into service,
providers of high-risk AI systems must perform a conformity assessment. This will allow
them to demonstrate that their AI system complies with the mandatory requirements for
high-risk trustworthy AI under the AI Act. Regularly, with narrow exceptions, the confor-
mity assessment is completed by the provider. This assessment needs to be repeated if the
AI system or its purpose are substantially modified.
However, the AI Act does not only address AI system providers, but also professional users
of AI systems, ‘deployers’ in the terminology of the AI Act. Deployers need to adhere to the
implementation guidelines of the AI system by the provider and use and control the system
adequately.
High-risk AI systems that are deployed by public authorities or entities acting on their be-
half will have to be registered in a public EU database, unless those systems are used for law
enforcement and migration. The latter will have to be registered in a non-public part of the
database that will be only accessible to relevant supervisory authorities. Market surveil-
lance authorities will support post-market monitoring through audits and by offering pro-
viders the possibility to report on serious incidents or breaches of fundamental rights obli-
gations of which they have become aware.
For certain AI systems, specific transparency requirements are imposed in order to secure
that users are aware that they are interacting with machines (e. g. chatbots) or exposed to
synthetically generated content (Art 50 AI Act). The transparency obligations need to be
fulfilled by information provided in a clear and distinguishable manner at the latest at the
point in time of the first interaction or exposure.
In relation to AI systems that are intended to interact directly with natural persons, provid-
ers need to design and develop such AI systems in a way that natural persons are informed
that they are interacting with an AI system. This disclosure obligation does not apply in
cases it is obvious from the point of view of a natural person who is reasonably well-in-
formed, observant and circumspect, that it is interacting with a machine, taking into ac-
count the circumstances and the context of use. Additionally, this obligation does not apply
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
708 23 Artificial Intelligence Act
All other AI systems which do not fall within one of the three previous risk categories (pro-
hibited AI practice, high-risk AI, medium risk AI with transparency obligations) can be de-
veloped and used subject to the existing legislation without additional legal obligations.
Voluntarily, providers of those systems may choose to apply the requirements for trust-
worthy AI and adhere to voluntary codes of conduct.
The AI Act defines a general-purpose AI model (GPAIM) as “AI models, including where such
an AI model is trained with a large amount of data using self-supervision at scale, that displays
significant generality and is capable of competently performing a wide range of distinct tasks
regardless of the way the model is placed on the market and that can be integrated into a
variety of downstream systems or applications, except AI models that are used for research,
development or prototyping activities before they are released on the market” (Art 3 no 63 AI
Act).
GPAIM are regulated in the AI Act in a two-tier approach. The AI Act specifies
obligations for providers of GPAIM and
obligations for GPAIM with systemic risk.
The exemption for GPAIM released under open-source licenses does not apply to GPAIM
with systemic risks.
■ 23.6 Penalties
When AI systems are put on the market or in use that do not respect the requirements of
the Regulation, Member States will have to lay down effective, proportionate and dissuasive
penalties, including administrative fines, in relation to infringements and communicate
them to the European Commission.
The AI Act sets out the following thresholds that need to be taken into account (Art 99 AI
Act):
up to €35m or 7% of the total worldwide annual turnover of the preceding financial year
(whichever is higher) for infringements on prohibited practices or non-compliance related
to requirements on data;
up to €15m or 3% of the total worldwide annual turnover of the preceding financial year
for non-compliance with any of the other requirements or obligations of the Regulation,
including infringement of the rules on general-purpose AI models;
up to €7.5m or 1.5% of the total worldwide annual turnover of the preceding financial year
for the supply of incorrect, incomplete or misleading information to notified bodies and
national competent authorities in reply to a request.
For each category of infringement, the threshold would be the lower of the two amounts for
SMEs and the higher for other companies.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
712 23 Artificial Intelligence Act
16
Proposal for a Directive of the European Parliament and the Council on adapting non-contractual civil liability rules
to artificial intelligence (AI Liability Directive), COM/2022/496 final.
17
Proposal for a Directive of the European Parliament and of the Council on liability for defective products,
COM/2022/495 final.
18
See also: https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai (07.04.2024)
19
For example: No author mentioned, Cruise recalls all self-driving cars after grisly accident and California ban,
The Guardian, 8.11.2023, https://www.theguardian.com/technology/2023/nov/08/cruise-recall-self-driving-
cars-gm (as of 7.4.2024), Waters, Richard, Cruise’s driverless car accident underlines the risks of AI, Financial
Times, 9.11.2023, https://www.ft.com/content/345c1dfd-b08c-44b6-81b4-c0bf8a46a43f (07.04.2024)
20
For example: Atkinson, Emily, Man crushed to death by robot in South Korea, By BBC News, 8. 11.2023,
https://www.bbc.com/news/world-asia-67354709 (07.04.2024)
21
For example: Warnung vor Betrug mit KI, ORF, 18.11.2023 https://tirol.orf.at/stories/3232973/ (07.04.2024)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
23.8 AI and Criminal Liability 713
This raises questions as to who is responsible by criminal law for the damage in such
cases,22 who is the offender and committed a crime for a negligent or intent malpractice,
which is causal for the damage, and has to be punished by a criminal court. But the most
important question is: how damages and thus criminal sanctions (as fines and imprison-
ment) can best be avoided.
22
Criminal liability must always be assessed in accordance with national criminal law. This article can only provide
principles and a general overview.
23
E. g., in Austria there is the Corporate Criminal Liability Act (Bundesgesetz über die Verantwortlichkeit von
Verbänden für Straftaten – Verbandsverantwortlichkeitsgesetz, VbVG).
24
See https://www.iso.org/standards.html (07.04.2024).
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
714 23 Artificial Intelligence Act
access to information and thus evidence in the proposal of the EU AI Liability Directive (see
Section 23.7 and FN 14), even if the latter only concerns non-contractual civil liability to
artificial intelligence, the evidence situation in criminal proceedings will also be signifi-
cantly improved. Objectively negligent, causal behavior will be specifically attributable to
individuals.
Only criminal responsible in case of negligent acting
negligent non-due-diligence
Usually, negligent misconduct is assumed in the development of AI. Intentionally causing
damage, i. e., at least considering damage to be possible and accepting it, is rarely the case.
Negligent means that the person acts from an objective perspective with non-due diligence.
This is the case if
a legal regulation (e. g. Data Protection Acts or other technical regulations Acts)
a standard of conduct, as ISO (see above FN 22) or DIN norms25, which is relevant for the
specific AI development or safety of the product, has not been complied with.
If there are no such regulations, which is often the case in innovative fields, then
a reasonable and prudent technician in this special field and his correct behavior must be
used as a standard of comparison.
Additionally, all predictable risks have to be avoided. Therefore, it is also contrary to due
diligence
if a risk/consequence assessment with regard to possible damage of the AI and the prod-
uct was not carried out accordingly.
Causality
In this context, the question arises as to how causality can be proven in such complex devel-
opment and product processes, particularly in the case of AIs with black-box components.
Only actions or omissions, which are (co-)causal for the damages can be sanctioned. As is
usually the case with innovative technologies with unknown causal processes, this is to be
determined according to a kind of natural or empirical law condition in criminal law. This
view is based on whether there is an explainable connection between an action and the re-
sult that occurs. In other words, if it is evident that a certain harmful result regularly occurs
when using an AI, then causation can be assumed.
sive documentation of all steps in the process of development and producing is utmost im-
portance.
In cases of (avoidable) damages during the development and tests or also through the prod-
uct launched on the market compliance offices have to elaborate strategies to avoid further
damages.
■ 23.9 In a Nutshell
AI in Different Industries
Stefan Papp, Mario Meir-Huber, Wolfgang Weidinger, Thomas Treml
“Consumer data will be the biggest differentiator in the next two to three years.
Whoever unlocks the reams of data and uses it strategically will win.”
Angela Ahrendts, Senior VP of Retail at Apple
In this chapter, we address how AI will permanently change the way we work and live. To
do this, it is necessary to first look at digital disruption, the stakeholders involved, and the
history.
Management consultants often speak of the “disruption of traditional business models” and
the “penetration of new markets.” What is meant by this is that AI will substantially change
the entire corporate culture in individual industries, including essential processes and how
we generate profits.
Many companies also view the digital transformation to a data-driven organization as a
company-wide paradigm shift. Instead of relying on personal experience and gut feeling,
decisions are made based on data evaluations and statistical models that enable predictions
of trends. In the final stage of expansion, companies automate processes within the frame-
work of prescriptive analytics. A key aspect in this field is that algorithms ‘make decisions’
themselves. A good reference example would be trading. Right now, most algorithms will
alert a broker about good deals, but it is still a human broker who makes the final decision
to buy or not to buy.
In the following, we will continuously refer to so-called ‘data programs.’ In a data program,
a program manager works with a team of data scientists, data architects, users, and all other
stakeholders to integrate the necessary infrastructure and processes into the company over
an extended time. Thus, it is not uncommon for this project to take several years.
Consulting firms help non-IT clients with the digital transformation and the migration to
becoming a data-driven enterprise. Typical clients can also be your bank, your favorite
grocery chain, or your energy provider. There is hardly an industry where there is no data
from which to extract value.
Sales strategies of consulting firms have heavily influenced data programs in these compa-
nies. Therefore, it is necessary to understand the approach of external consultants working
in these industries. The campaigns in which the external service providers outline how they
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
718 24 AI in Different Industries
intend to transform their clients into a data-driven organization in group-wide data pro-
grams will determine how the companies themselves perceive this measure.
Commercial software applications that generate value from data are almost as old as com-
puters themselves. Yet, the growth of large, international consulting firms and system inte-
grators over the past decades shows that this continues to be a profitable business model.
However, technicians often make the mistake of trying to sell technology at pre-sales meet-
ings. Not only are managers not interested in technical details, but discussions between
skeptics and proponents of new technologies also break out time and again in technology
meetings. Not infrequently, the goal is lost sight of altogether. Developers debate which
implementation details lead to the desired results, while other project participants are not
even consistently clear about the big picture.
Technology debates have always been held and will probably exist as long as
there are multiple ways to implement technical requirements. At first, skeptics
resisted any form of digitization. And once in a while, this resistance has to be
renewed.
“Computers? We have our beloved pen and paper method to make our
calculations; it’ll only crash if we make a paper airplane out of it. Computer!
That’ll never work.”
“Data warehouse? We don’t warehouse our data. We’ve built our own software
that does something with the data, too. Data warehouse! That’ll never work.”
“Data lake? In a lake, I go swimming, but I won’t drown my data there; I have
my warehouse. Well-proven since the 90s. Data lake! That will never work.”
“Cloud? Why should I put my data in a cloud? I have my lake, and anyway, my
data is more secure in a data center of my choice. Cloud! That’ll never work.”
“Data mesh? I may eat mashed potatoes, but I do not need a data mesh. That’ll
never work.”
To be fair, the skeptics are often the ones who have experienced the downside
of innovation. For example, if an overambitious IT manager orders a new
system, it can sometimes happen that the changeover causes more problems
than expected. Those who have to support the new system within the company,
and have to put up with frustrated users who are emotionally upset because
new problems keep occurring, also suffer as a result.
Accordingly, there is little technology in the PowerPoint slides of many consulting services
providers in the AI environment. Abstract terms such as ‘Big Data’ are made suitable for the
masses by salespeople repeating them like a prayer mill. It is essential to treat the terms
superficially and not delve too deeply into the technology. Distributed data processing via
open-source data platforms becomes ‘Big Data,’ applied statistics on data via deep learning
frameworks becomes ‘Smart Data’ or ‘Data Science,’ and streaming processing is touted as
‘Fast Data.’ Terms like ‘Big Data,’ ‘Advanced Analytics,’ ‘Machine Learning,’ and ‘AI’ are part
of sales campaigns. Since the terms are presented and interpreted differently depending on
the consulting firm, enterprises often have no uniform understanding of essential terms.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24 AI in Different Industries 719
Ultimately, however, there are always arguments in favor of using new technologies. No one
can deny that paper cannot be retrieved centrally and that distributing copies of documents
to all employees by hand is not an efficient solution either. After the first wave of digitiza-
tion, manufacturers realized that they could sell targeted data-driven solutions that com-
pletely mapped a company’s data world. Non-IT companies gradually became dependent on
IT as they began to digitize their business processes.
Until 2010, business intelligence was considered a corporate area for analyzing data in an
organized manner and deriving decisions from it. The appropriate technology for this was
the enterprise data warehouse. Companies’ revenues increased with DWHs, but vendors’
revenues and prices also went up. Many customers complained about vendor lock-in, the
dependence on expensive system integrators, and large consulting companies with a repu-
tation for rebranding juniors as senior consultants by putting them in business suits.
It may have been the cost-cutting argument that initiated the shift to distributed open-
source platform solutions such as Apache Hadoop in 2010. The economic crisis of 2008 had
increased the pressure to reduce costs in many companies. In addition, technology advisors
praised commodity hardware and open-source as lower-cost alternatives to appliances. So
consultants and systems integrators created a new trend: They branded a comparatively
inexpensive open-source product called Apache Hadoop as the ultimate successor to expen-
sive enterprise data warehouses. An important argument for its use was the avoidance of
the vendor lock-in mentioned above.
But the promises of many consultants in business suits have not been entirely fulfilled.
Someone also wants to make money from open-source software, and, strictly speaking,
open-source offerings can often be identified as a sham. Many companies use the buzzword
‘open-source’ to sell hybrid products, that is, products that contain open-source components
and proprietary components, resulting in corresponding licensing costs.
Setting up a complete data platform consisting of hardware and software components is
complex, especially when policies require demanding service level agreements. Moreover,
in most cases, companies lack personnel who can ensure smooth operation. As a result,
functional (e.g., various services not running) and non-functional deficiencies (e.g., multiple
services having poor performance) occur frequently.
In addition, decision-makers should not underestimate the cost of permanently operating a
complex hardware and software solution. Someone on payroll must replace defective hard-
ware regularly, and maybe someone else must maintain and secure server rooms. Suppose
the distributors charge service level costs. In that case, questions arise about how much
money companies save when migrating from a DWH provided by a provider as a complete
package to a self-managed pseudo open source landscape.
Many strategy advisors perceive cloud-based solutions as the ideal way out of this dilemma.
With Amazon S3, for example, each gigabyte of storage cost two cents per month in 2018.
In addition, experts consider this storage to be almost permanently available, and the risk
of losing data is usually lower than hosting it in your own data center. Plus, you can roll out
all the popular data platforms, including Hadoop, in the cloud. And if a company no longer
needs various data services, it can cancel all cloud services at any time. In contrast, an
on-premise solution becomes obsolete; the purchased hardware will continue to appear on
the balance sheet.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
720 24 AI in Different Industries
After the DWH and the Hadoop hype, we now have the cloud hype. But this does not mean
that the other technologies are no longer relevant. First and foremost, the market is becom-
ing more diverse. Just because you can store data in object stores of cloud providers, which
is becoming cheaper and more accessible for companies, does not mean that other storage
technologies have become obsolete in principle.
Chapter 2 introduced polyglot data storage1, which describes having different technologies
for different data. However, some tasks are still best solved by traditional DWHs; for others,
a distributed file system fits better. Moreover, a company can host any data platform in the
cloud.
This trend is also evident in various industries. In some sectors, personal data protection is
paramount; in others, specially coded sensor data processing. The data solutions and re-
quirements of a bank differ significantly from companies that build machines.
This trend requires a change in thinking. In the past, system integrators often built an eco-
system around the system they offered. Then, data scientists and business analysts would
try to find use cases that would allow them to max out the provided hardware and software
solutions. This approach is being replaced by “use case first” thinking: At the beginning, the
project team defines a use case they want to implement to generate new business models
or optimize existing processes. Then, they select the technology that can best solve the
requirements defined by the use case.
In this chapter, we look at various industries with their typical analytical use cases to
understand how AI and data science can change business models in the typical representa-
tives of these industries. It is important to emphasize that we can only provide an overview
here and that there are many other, sometimes highly specialized, use cases besides those
described.
In the description of the use cases, we first provide an overview of the industry. In doing so,
we outline key challenges that a company can address using machine learning. The next
step is to envision how the industry may look in the future when it is thoroughly infused
with artificial intelligence. Finally, we also address what possible initial steps we can take
to approach this vision.
■ 24.1 Automotive
Modern cars are driving sensor data generators that produce terabytes of data during a
journey. Startups can monetize this data. Data enthusiasts even claim that vehicles will
generate more significant revenue from data than from the sale of vehicles themselves at
some point.
However, many car manufacturers still see themselves as mechanical engineers because
they could manufacture their products long before the first computers came onto the m
arket.
So while even the most die-hard car builders will admit that cars are no longer conceivable
without IT, mechanical engineering is still the focus for most. And since, in the end, it’s
1
https://en.wikipedia.org/wiki/Polyglot_(computing)
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.1 Automotive 721
a lways about money, IT and production become competitors when it comes to allocating
budgets.
Authors like Lawrence Burns describe how the automotive industry works2. They show how
suppliers work, how they work in production, and how research differs from other indus-
tries. Connoisseurs of the individual manufacturers claim that among the German carmak-
ers is the corporate culture so different that entire books could be filled with a comparison
of Volkswagen, BMW, Daimler, and Co.
In the automotive industry, plants often operate independently of headquarters. For exam-
ple, when the leaders plan a new project, the plant managers compete for the order. They act
as if they have to compete against other competitors, even though they belong to the same
company. This knowledge is essential for data experts because it explains why the cross-
plant data lake does not exist and why data availability also has a corporate policy dimen-
sion.
Some experts predict that the expertise and deep understanding of the underlying data
could reshuffle the cards and change market share in favor of those who have already
started to deal with Big Data earlier. We can also speculate that the use of electronic engines
will reduce complexity in cars. For example, an electric vehicle does not need a transmis-
sion, unlike cars with internal combustion engines. Suppose the cars become less complex
in their design. This change can strengthen IT teams, as the possibilities for qualitative
differentiation from the competition move from the “hardware” (the car) to the “software.”
Artificial intelligence and the collection and analysis of data will massively change the
automotive industry in the next 20 years. This transformation includes a lot of changes for
drivers. We will describe some of these changes in the following section.
24.1.1 Vision
You are sitting together with a good friend in the evening. Over a glass of wine, you dream
of Paris and the Champs-Élysées.
“Why don’t we have breakfast in Paris tomorrow?”, you ask.
“Yes,” your friend replies, “why not?”
With the spontaneous decision made, you get into your car of the future. In this one, there
is no more steering wheel and no more accelerator pedals. Instead, you have a fully self-
driving vehicle.
You don’t own the car; you rely on car sharing like everyone else. Additionally, this means
that cars are only parked when they are serviced or refueled. The goal of vehicle providers
is to generate an AI that guarantees a high utilization of cars.
Your car is arranged like a sleeper with a few specific selections that your AI has set up for
you. Finally, curtains closed, maybe a little bedtime movie based on your taste profile in the
car, and you fall asleep.
In the morning, Edith Piaf will wake you up.
2
Burns, L. D., Shulgan C.: Autonomy: The Quest to Build the Driverless Car – And How It Will Reshape Our World.
Ecco, 2018
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
722 24 AI in Different Industries
“Non, je ne regrette rien”. First, you enjoy the view of the Eiffel Tower. Then, the AI suggests
a bistro where you can have breakfast. But, before that, you stop at a new place that didn’t
exist until a few years ago, a kind of freshening station with a shower and bath.
While you shower, you look forward to the day. You can always get from A to B and C with
small self-driving autorickshaws. Since the AI knows your preferences, you can also let
yourself be surprised by where you are taken in Paris.
Lost in thought, you think of the past. How much everything has changed. You remember
that there used to be offenses like drunk driving and that many people died because of it.
That’s all history now. You know that your car collected data as you drove and passed it on
to various stations. Data scientists can use sensor data to determine the status of road sur-
faces, for example, and figure out when roadway resurfacing is necessary. But you don’t
hear about any of this. You just tell the car where to go, and it takes you there.
24.1.2 Data
Autonomous Driving
Self-driving cars are perhaps the best-known use case in a digital society. The idea that all
you have to do is get into the car and then be taken safely to your destination is on every-
one’s mind.
However, many “autonomous functionalities” are already part of today’s cars.
The fully autonomous driving car is the goal, and there are many milestones along the way.
Autonomous driving starts with parking aids, cruise control, and distance alarms. Then,
sensors detect hazards and react accordingly. Each new generation of cars will bring more
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.1 Automotive 723
autonomy, and the development up to the fully autonomous driving car will be a flowing
process.
Many of the industries described in this chapter will also be affected by autonomous d
riving,
significantly if it fundamentally changes the delivery of goods. We can argue that autono-
mous driving can have a similar impact on our society as the Internet and smartphones
before. Thus, it is not for nothing that giants like Google and Apple invest much money in
autonomous driving.
In this context, we recommend books like “How Autonomous Vehicles will Change the
World: Why self-driving car technology will usher in a new age of prosperity and disrup-
tion.”3
Car Quality
We can use masses of data for the production of cars. Load tests and, in particular, stress
tests are beneficial here. Stress means, for example, extreme weather conditions. Accord-
ingly, car testing often takes place in remote areas with extreme temperatures.
We can use sensor data to evaluate whether the test vehicles can cope with various extreme
situations and determine maximum load values. Production is thus accelerated, and the
quality of the goods produced is optimized.
Data Monetization
Your car is a moving all-rounder. The data you collect can be used, for example, to predict
traffic jams and measure changes in road surfaces. Sensor data from a journey can then be
sold to organizations responsible for running a transport network.
When we think of monetization, that is, “turning something into money,” the trend of car
sharing, in particular, becomes fascinating. Cars stand 90 % of the time. Understandably,
you can’t rent out your car while it’s parked at your workplace; how would you get home?
But what if we assume self-driving cars?
You could also rent out your car analogously to your apartment with an “Airbnb-like” service
through this use case. While sitting in the office, you earn money by having your autono-
mously driving car drive tourists. In the evening, of course, you are picked up by your car.
This use case also leads many market researchers to believe that car-sharing will increas-
ingly displace car ownership. But, strictly speaking, it doesn’t matter whether you are
driven by your car or a random car, as long as the vehicle is not a status symbol for you.
24.1.4 Challenges
The central challenges are societal. Self-driving cars will have a massive impact on our
society. Not only will they turn the entire transportation system upside down, but self-
driving cars can also upheave the entire delivery logistics industry, which can have a
massive impact on other industries. To understand this, you only need to ask yourself one
question: Would you have a purchased item delivered by taxi today? Unless it were urgent,
3
https://www.amazon.com/dp/1733328769
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
724 24 AI in Different Industries
you, like many other people, would probably be put off by the cost. However, autonomous
driving can reduce this cost factor to a fraction.
In the future, goods - unless drones deliver them - will simply be loaded into a car and sent
to a destination without a driver. As a result, freight forwarders will need fewer drivers. We
will discuss this in more detail in section 24.13. Likewise, the need for taxi drivers and
chauffeurs is likely to decrease significantly.
With the acceptance of self-driving cars, people are also more accepting than before to
surrendering sovereignty in survival-critical areas to a machine. Sure, we already trust
computers to run their applications correctly, and of course, there are computer systems
upon which lives depend, at least indirectly. But autonomous driving cars are a whole new
dimension here, as no one will deny that statistically, people often die in car traffic. By
accepting self-driving cars, we are in a sense declaring that we are more likely to trust a
machine with our lives than other people or ourselves. After this step, many people might
be more willing to trust machines with their lives in other areas as well. For example, what
would you say to a fully automated surgical procedure without a human specialist at the
scalpel? This list could go on and on with other instances where a computer could offer more
safety than a human.
From a technical point of view, autonomous driving is a challenge because producing data
costs energy, and the data transmission rate of mobile networks is not good enough every-
where to get the data to the destination in the required time. Autonomous driving is there-
fore strongly linked to energy and telecommunications. It must also be possible in more
remote areas, where the infrastructure is less developed than in a metropolitan area.
It is also essential that autonomous vehicles can cope with chaotic big city traffic. Presum-
ably, California’s highways are currently the busiest routes for autonomous driving cars, as
Tesla and Google have been testing their cars here for years. However, California road traffic
follows stricter rules than traffic in many other countries. For example, while you some-
times have no speed limits on German autobahns, you can expect restrictions of 65 miles
per hour in California, which is the equivalent of about 105 km/h. Also, general traffic
conditions differ in other countries. A car that can drive autonomously from Palo Alto to
Sunnyvale without any errors could run into problems in the traffic of Berlin, Delhi, or
Yerevan.
From a legal perspective, liability issues need to be clarified. For example, in a traffic
accident involving human drivers, there is usually a straightforward explanation of the
accident and who was at fault. The causes of accidents can be clarified even more quickly if
self-driving cars log their journeys in detail, but we also need a legal basis for proceeding in
the event of an accident.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.2 Aviation 725
■ 24.2 Aviation
We divide Aviation into the areas of airlines and airports. Of course, there are also overlaps,
as airlines are dependent on airports.
An airline’s job is to get passengers and cargo to their destination safely and as punctually
as possible. Punctual airlines have a better reputation, and this reputation can have a medi-
um-term impact on business success. Even if the price of tickets is a significant factor in
customers’ decisions to choose a flight, it can be costly for consumers if flights are delayed
and they miss a connecting flight as a result. Compensation does not always fully cover the
damage incurred. This fact means that lines that tend to be unpunctual also risk losing
price-conscious customers, even if they offer their flight at the cheapest conditions.
The expected flight experience can also be a decisive factor in choosing a flight, especially a
long-haul one. For example, the desired boarding service plays a significant role for many
passengers when selecting intercontinental flights. Many passengers also perceive substan-
tial differences in the services offered by airlines.
Airports play their part in ensuring that flights can meet their deadlines and that guests get
from A to B quickly. Many airports measure customer satisfaction and evaluate data to find
out how they can serve their passengers even better. A reference example is Vienna Airport,
where passengers can give feedback on cleanliness in the toilets and friendliness at the
security check. In this way, these airports develop a better and more consistent service
quality than airports that ignore such data.
In the following section, we will show how we can use data to improve Aviation services.
24.2.1 Vision
You remember the old days. You remember the moment when you boarded a flight with a
tight connection. While your travel companion wonders if the connection time isn’t too
close, you wave it off: “It’s okay!”, you return confidently, “they’re professionals.”
Then, shortly before boarding, the bad news: delay. No one can or wants to tell you what’s
going on. So you wait, restlessly, sitting as if on hot coals.
You finally get to board the plane after complaining long and hard, and you’re anxious for
every minute. However, you are annoyed by passengers who take their time taking their
seats, even on a delayed flight.
Until you reach the connecting flight gate, it remains unclear whether you will have to
negotiate an alternative flight at the transfer desk. A relaxed flight is out of the question,
especially if you miss necessary appointments due to delayed arrival.
You wipe that thought away and come back to the present in the future. You live in modern
times; you know that everything is regulated, everything runs correctly, and airlines have
become more flexible in getting their passengers to their destination. For example, com-
puter models generate an ideal flight load factor and ensure departure times are based on
actual demand. There is also an early warning system for all phenomena that can trigger
delays, and airlines react to these and thus improve their punctuality.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
726 24 AI in Different Industries
And then there’s another thing of the past: there used to be a time when the selection of
your neighbor depended on luck. Having to sit next to someone with whom you have no
chemistry for several hours can be frustrating. However, since the AI knows you and the
other guests, they will have no problem finding a seatmate that works for you.
24.2.2 Data
Turbines in aircraft generate terabytes of data per hour. Aircraft also collect position and
weather data supplied by satellites during the flight. These can be used in models to predict
expected delays.
Mathematical models also show how a machine must be fueled to bring the calculated total
load to its destination.
Data from social networks can also be insightful for airlines. The more the models know
about passengers’ preferences, the easier it will be to personalize the flight experience and
make it as pleasant as possible.
Personalization
You are on a long-haul flight. In the plane’s entertainment program, you’ll find movie sug-
gestions that match your preferences. You may even be able to continue movies you started
watching at home or an earlier flight, as the database of a streaming service for movies you
subscribe to is also accessible from the plane. An AI model tailored the meal plan to your
tastes. This choice was made possible by an AI model. This model has calculated which
menu best suits the guests. For example, if the AI detects a high proportion of vegans, the
menu takes this into account.
AI can optimize seat selection. With the appropriate data, AI can seat you next to people
with similar interests. For example, people who are interested in finance take seats next to
each other. A few rows behind, parents discuss child-rearing. Even for people who prefer not
to talk to anyone, the AI will find like-minded people. These choices also affect the satisfac-
tion of the flight personnel, as it can be very unpleasant for them to have to mediate disputes
between passengers.
Customer Satisfaction
At an airport, there are tons of systems that measure passenger satisfaction. This starts
with ratings of the restroom cleanliness and extends to the friendliness of the handling in
the security area. The evaluation of this data can increase the quality of the services offered.
When these satisfaction scores are aggregated and publicized, it attracts other guests.
Transit at an airport that is confusing and where staff is known to be unfriendly may deter
passengers and make them plan stopovers so that they only stop at airports that guarantee
minimum satisfaction. This step may have an even more significant impact when computers
suggest flight itineraries and passengers mark courteous treatment in flight as an essential
criterion.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.3 Energy 727
Delay prediction
Based on historical data and experience, it is often easy to predict which flights might be
delayed, but extensive data analysis helps refine such a model even further.
What are the weather conditions? What kind of passengers is boarding? The more data the
model has, the more accurately an expected delay can be determined, and corrective action
can be taken. And if not: the earlier that passengers are informed about delays, the less
hassle there is.
24.2.4 Challenges
■ 24.3 Energy
The liberalization of the electricity market led to high competitive pressure among energy
suppliers. Suddenly, companies that operated in a price-protected environment and lived by
producing electricity and selling it at regulated prices were confronted with the reality that
supply and demand can cause a variable price.
The energy sector consists of the four areas
Production,
Sale,
Storage and
Trade.
Data science is a common thread running through these four areas. For example, if we
produce too much electricity and do not find enough suitable customers, electricity must be
stored temporarily at a high cost or sold cheaply. Storage is expensive because storage power
plants have to be built and maintained, and electricity is also lost during storage.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
728 24 AI in Different Industries
So, suppose we can predict prices and demand and calculate the potential to reduce the
price of electricity production. In that case, we can minimize the storage of electricity and
thus be as efficient as possible.
24.3.1 Vision
Your self-driving car will take you past meadows and forests during your trip to Paris. The
air is pure. A few years ago, there were still vast factories of electricity producers here.
Today there is only green energy, and only what is needed is produced.
You once talked about this with a friend who explained that we could predict the energy
demand extremely accurately thanks to AI. The model analyzes historical data and can
consider other factors that affect electricity demand, such as weather and events. As a
result, producers align renewable energy production with meteorological forecasts. Since –
thanks to our AI – sun, wind, and water are predictable, we can match our entire production
with consumption.
Another friend works in the construction industry and proudly tells you how AI has
impacted there, too. When construction companies erect new buildings, they pay attention
to maximum energy efficiency. The data scientists at the architecture firms have now
optimized their models and leverage historical data to determine how buildings at different
locations are sustainably energy-efficient.
New passive buildings even give off energy under ideal weather conditions. A friend who
often travels for work raves that she not only earns money via Airbnb: The surplus energy
that the house generates via the photovoltaic system also brings her money. That’s why she
plans to buy a few old houses, modernize their energy budget and rent them out.
Then your car reports that a battery change is due. So your car drives to a battery change
station, which used to be a filling station when people still relied on fossil fuels. Robotic
arms remove the used battery from under your car and insert a fully charged battery within
seconds. You don’t even have to leave the car to do it.
Read “Bill Gates: How to avoid a climate Disaster”4 and ask yourself how to use
data science to understand Bill Gates’ vision.
24.3.2 Data
Energy supply companies want to know how high the expected consumption is. One source
for this is historical data collected from the corporations themselves.
Energy consumption is also strongly linked to the weather. A cold and early winter means
different consumption patterns. People have to heat more and spend more time in their
4
https://www.amazon.com/How-Avoid-Climate-Disaster-Breakthroughs-ebook/dp/B07YRY461Y/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.3 Energy 729
homes. On the other hand, a good season with lots of snowfall for winter sports regions
implies that hotels are fully booked and ski resorts are fully operational. Accordingly, the
energy demand in the area is higher.
Analytical models can use historical data to predict a region’s likely energy needs when fed
with current weather data. But of course, you also want to know how much energy is gener-
ated from photovoltaics or wind power and whether energy production should possibly be
boosted elsewhere.
We can derive the energy demand in a region also from other factors. For example, if there
are events in a city that attracts visitors, more energy is consumed.
Dynamic Pricing
Data models calculate the expected electricity price. Accordingly, production is throttled or
cranked up to produce the ideal amount for sale. The reason is that those who make sur-
pluses must temporarily store the energy produced. When energy is stored, energy is lost.
Moreover, the process is not cost-effective.
So the more accurately an electricity provider can predict how much energy will be needed
based on human behavior, weather data, historical data, and other influences, the more we
can optimize production and pricing.
Predictive Maintenance
Predictive maintenance means identifying which maintenance is required at any point in
time to keep equipment in good working order. Typical measurable characteristics are signs
of wear and tear. For example, wind turbines or photovoltaic systems that are out of opera-
tion because they are defective are things that a utility provider wants to prevent.
Suppose we can use sensor data to predict which components of an energy producer could
fail imminently. In that case, we can take preventive action and avoid failures.
We can model predictive maintenance using linear regression. For example, we can collect
measurement data over time and label each record as to whether or not it was associated
with a failure. If you then match live data with a regression model, you can identify problem
cases.
Weather Forecast
Weather forecasts are more important for energy providers. Weather defines the output of
renewable energy. The more sun, the more production of photovoltaics. The more wind, the
more production of wind energy. Weather forecasts also help us to predict the energy
demand.
An energy provider who can predict the weather well can also throttle energy production
through fossil fuel. This measure is one step towards carbon reduction as defined in the
Paris Agreement.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
730 24 AI in Different Industries
24.3.4 Challenges
The energy market is under pressure because private consumers can choose their suppliers
after deregulation of the electricity market. Unfortunately, many producers were not pre-
pared for this development.
The energy industry is one of the industries in which little IT innovation was necessary in
the past. For example, in a non-liberalized market, a supplier’s employees could manually
read out metering data in households and later record the evaluations in an application.
Accordingly, there was little pressure to replace the old electricity meters with smart
meters. However, new competitors who are not afraid of AI are working with data science
models that can help them produce electricity more efficiently.
Of course, we must not ignore data protection. For example, is an electricity provider a llowed
to determine a consumer’s regular consumption, or does it violate the individual’s privacy?
The fact that it is possible to determine whether people have been away from home for a
long time is also particularly explosive. This information would be an invitation for b
urglars.
AI and Energy are connected with sustainability goals to reduce carbon emissions. However,
power stations are expensive to build, it takes years to get a return on investment, and they
can be operative for decades. So while we optimize and balance the output of renewable
energy sources, some countries might not want to take their coal-fired power stations off the
grid. The book mentioned above of Bill Gates addresses this and more of the problems in
more detail.
■ 24.4 Finance
Many financial institutions, banks, stock exchanges, and insurance companies pump vast
investment money into their data platforms. In principle, the banks’ business is to assess
risks in order to be able to offer lucrative interest rates. Additionally, they also make money
from ATMs, currency exchanges, and card fees. Data scientists can use the information ex-
tracted from data to optimize these revenue streams.
The business model of financial institutions differs in many cases from other industries.
Authorities and financial institutions impose regulatory requirements on banks. Unfortu-
nately, this means a lot of bureaucracy, which can lead an innovation-enthusiastic IT expert
to the brink of despair.
24.4.1 Vision
At home, you look at relics of the past, all arranged in their display case: A few books are
there, an old telephone with a dial, and a wallet.
Contactless technologies mean you no longer have to put cash on the table. Instead, we can
use NFC chips in many stores to register who has taken which product from the shelves and
then deduct the amount as soon as the customer leaves.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.4 Finance 731
So the bank of the future only offers online services; branches with counters have long since
ceased to exist. Instead of going to the bank, you meet with your advisor in a coffee house if
you want a personal conversation. But many also conduct banking transactions virtually.
24.4.2 Data
Banks usually have a lot of structured data, as data is often generated via transactions,
which are always structured. A transaction in a technical sense means that operations are
carried out entirely or not at all. If a bank customer withdraws money from an ATM, a data-
base entry stores all information about this withdrawal.
Unstructured data is sometimes found in risk assessments, for example, when data analysis
uses social media data.
In principle, banks know about their customers by analyzing financial transaction data and
matching it with demographic data.
Fraud Detection
Fraud detection, the recognition of fraudulent financial activities, can best be explained
using credit card transactions. This use case is probably the prime example of how linear
regression is applied in a bank, as fraud attempts repeatedly happen, especially when credit
or ATM card data has been stolen.
Imagine that a customer’s behavior, let’s take an 80-year-old retiree as a reference example,
suddenly deviates from his usual actions. Instead of going to the grocery store around the
corner, as usual, he suddenly uses his card abroad to buy exotic items and send them to
addresses that do not correspond to his home address. How likely would you find it that
someone here would misuse the retiree’s card? And please also consider that an elderly
gentleman might not immediately notice this misuse.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
732 24 AI in Different Industries
Credit card fraud detection works via so-called balanced scorecards, in which each attribute
of a transaction is evaluated individually for deviations. Actions that correspond to normal
behavior are risk-free. Various characteristics of each transaction, such as a new IP address,
new delivery address, an unusual product category, or payment deficiency, are all assessed
for risk. Similarly, algorithms cross-check historical data to include historical fraud in the
risk assessment. For example, transactions made from countries with more fraud attempts
in the past will be factored into the model.
Adding all factors together, we receive a fraud probability. If we exceed a threshold value,
the card is blocked. A customer service representative contacts the customer to clarify the
situation.
Customer Classification
We addressed classifying customers in the loan request prediction section. However, in
addition to “housebuilders,” other groups of people can also be classified. One factor could
be risk tolerance. Some people use securities to speculate; others are incredibly risk-averse.
We may identify some clients as potential entrepreneurs or career-minded individuals.
As result, these risk-averse clients may be assigned to a different customer service repre-
sentative than modest individuals who prefer a stable lifestyle.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.5 Health 733
24.4.4 Challenges
Fintechs are increasingly competing with traditional financial institutions. They don’t have
branches, and they don’t have to manage legacy IT architectures that still run applications
programmed in Cobol. Fintechs are correspondingly agile and better able to leverage data
science, bringing better service quality and ultimately leading to new and robust competi-
tion for traditional banks. There is a consensus that there is no such thing as “too big to fail.”
Even the big ones are gambling with their existence if they ignore AI and Data Science.
However, other financial institutions don’t see it that way, and some predict that market
shares could change in the next few years.
In the banking environment, it is essential to comply with the regulations imposed by the
legislator. These requirements make it necessary for every step in a project to be approved
by several departments. Thus, a bank is often forced to follow the waterfall model, which
requires analysts to specify software functionalities in advance. Only when security experts
and regulatory experts have approved the specification can the software be put into opera-
tion. Furthermore, since an audit may occur at any time, the documentation must always be
kept in a financial auditor’s representable status.
The regulatory requirements are often cited as a reason why working as an IT expert in a
bank is long and tedious. Instead of developing and expanding software in an agile manner,
an IT expert here is often severely restricted in her scope of action.
■ 24.5 Health
When we think of health (in the sense of the term “health industry”), several areas are
relevant. These include the pharmaceutical industry, medical research, medical profession-
als, hospitals, and other organizations to keep people and animals healthy. Veterinary med-
icine should not go unmentioned either, as statistically, the number of pet owners is increas-
ing, and they spend a lot of money on the well-being of their charges.
There is a lot of money to be made in the health industry. In Maslow’s pyramid of needs,
maintaining health is a fundamental building block. Accordingly, we spend a lot of money
on health.
Opinions differ about the use of IT in medicine. It is clear to everyone that working with
medical devices that are not connected to computers is no longer possible. Still, many ask
themselves how the stored data is handled and whether it is good to let data scientists loose
on medical data.
Data Science can improve efficiency in all healthcare institutions. In the following sections,
we will show that there are numerous application examples. We will also explain why Data
Science in medicine can trigger a snowball effect, from which perhaps one of the largest
application areas of AI and data-driven innovation can emerge.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
734 24 AI in Different Industries
24.5.1 Vision
You are reading about Prometheus5, the world’s largest healthcare industry data project. For
years, zettabytes of information were collected during voluntary long-term test series, med-
ical examinations, and many other channels and stored in the cloud. The dimension of these
investigations is unimaginable.
Scientists have studied all the details and gained comprehensive knowledge about which
factors influence a person’s life. Finally, there are statistically provable facts about which
foods are healthy and how a person must live to stay fit and vital.
Exercise
Play through a complete process of a hospital visit and try to split them
into atomic operations from which architects can then design a microservice
architecture. Can you estimate how many different services we would need
to implement?
Health care is individual. We can measure genetic prerequisites and derive health plans
from the results. You collect all data about yourself via various devices connected to your
body. These devices recognize your ideal biometric values and corresponding deviations
from them. You can also instruct your AI assistant to prepare nutritional suggestions that
are tailored to your individual needs, taking into account, for example, a deficiency of par-
ticular vitamins or minerals.
There is an early warning system that alerts people that they are risking harm via their
lifestyle. And who doesn’t take action when a system signals the risk of a heart attack? In
addition, diseases such as diabetes and other ailments that can accompany a person for the
rest of their lives are also prevented or delayed in this way.
The visit to the doctor is also more efficient. There is now a prescreening, and the order of
patients is better managed. In addition, it is possible to make doctor consultations in a dvance
on the web. In addition, you can be examined with the help of virtual reality technology, and
only in exceptional cases do you have to be physically present.
While out for a walk, you are overtaken by a jogger you have known from childhood. He had
both legs amputated after an accident. But thanks to AI, new prostheses have been created
that many see as a symbiosis between humans and technology. No healthy person would
even be able to come close to keeping up with a runner using prosthetics. Some people are
already planning to replace their biological limbs with prosthetic limbs, and the perfor-
mance of artificial organs is also beginning to exceed that of natural organs. In the (fictional)
“Bio-Tech Fusion Handbook,” you can read about other trends that sound incredible.
5
This is a fictitious project.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.5 Health 735
24.5.2 Data
Patient data is probably the most protective and valuable information there is. If all data on
all patients and their lifestyles were available, we could make extreme progress in health
research.
The human body provides vast amounts of analyzable data: Factors such as pulse, blood
values, body fat values, blood sugar levels, and the quantities of various hormones are only
part of a collection of many other values only known to medical experts. If this data is con-
stantly measured and evaluated, it could trigger medical revolutions.
Complaint Reduction
We can measure hospital satisfaction. For example, patients can provide feedback about the
food, the cleanliness of the toilets, and the staff’s friendliness. Patient satisfaction managers
explore the input and determine where action is needed. For example, customer happiness
departments can develop programs to improve performance for staff who receive poor
ratings.
Competitive Sports
Many athletes already measure their biometric indicators, such as pulse, and try to optimize
them. Coaches can use the data to determine which values their client needs to optimize to
achieve top performance and determine which factors impact those values. From this infor-
mation, trainers create new diets and training plans.
We can use AI to detect injuries and signs of wear and tear at an early stage. There must
presumably be an extremely high willingness to evaluate data in competitive sports since
performance improvement is often linked to financial incentives.
24.5.4 Challenges
As is often the case, where there is the most benefit, there is also the most risk. Concerns
about data confidentiality are perhaps nowhere more vital than in the healthcare sector.
Imagine you are laid off. You find out after some time that your employer has gained access
to your health records. It turns out that your data revealed a susceptibility to contracting a
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
736 24 AI in Different Industries
genetic disease. Other potential employers also get access to the data. For many, this is a
horror scenario.
The question of the protection of patient data is an important issue. We should analyze data
only if people give their explicit consent. However, the question then arises as to whether
we are getting enough data that will help us to gain new insights.
Whichever way you view it, one thing remains true: Nothing is more precious to people than
their own lives. So if we can use data science to find out how to improve the quality of life,
humanity will benefit.
When writing this book, the majority may see dangers rather than benefits in exploring
health data. The Automotive passage outlined a possible paradigm change once people start
trusting machines more than other humans. In the Middle Ages and Renaissance, alche-
mists were looking for ways to reverse aging and extend life. Once the amount of people who
believe in the value of data exploration reaches a certain threshold, health data processing
could become the topic of the biggest ‘data rush’ in history.
■ 24.6 Government
Many people immediately think of Big Brother when they think of data and government.
And when you look at the structures that intelligence agencies have built up, it’s not en-
tirely absurd. If you were to rank the users of Big Data solutions, intelligence agencies
would probably be somewhere near the top.
Elections are also sometimes influenced by Big Data analyses. This issue came to promi-
nence through the British ‘Brexit’ referendum and the election of Donald Trump in 2016.
Election campaigners who have access to user profiles via social media such as Facebook
and Co can analyze their preferences and target these people with the appropriate topics.
24.6.1 Vision
24.6.2 Data
A lot of data in the public sector is structured and sensitive. Your tax file and your entry in
the central register of residents, for example, are areas that you do not want unauthorized
persons to see.
Public institutions sometimes make their data open-source. San Francisco is a good exam-
ple, whose data is a popular source for data scientists. For example, users can find data on
registered businesses’ locations, licensed food stands, and criminal incidents.
The biggest Big Data user is probably the intelligence service. Here, the data sources are
almost inexhaustible. Every form of monitored communication can also be evaluated and
analyzed. Social networks such as Facebook and telephone data that must be searched using
deep package inspection present new “volume challenges” for data evaluation due to data
increase every year.
Tax Fraud
One issue in the public sector is tax fraud. This scenario includes many well-known fraud
scenarios such as the missing trader fraud.6 Data analytics can help finance ministries track
down companies and citizens who evade taxes.
6
https://en.wikipedia.org/wiki/Missing_trader_fraud
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
738 24 AI in Different Industries
perpetrators live (geographic profiling). This method is not new in forensic science, but it
is greatly improved by the possibility of including topographical data and street layouts
in modern Geographic Information Systems (GIS). The processing of the results in such
systems also increases the usefulness for law enforcement authorities. Chapter 27 covers
ethical questions in detail.
In the fight against organized crime, modern data technologies also help to improve existing
methods. For example, monitoring these organizations’ communications and financial flows
allows conclusions about their social structure and economic network. However, in contrast
to the intelligence mentioned above, monitoring communications are explicitly targeted at
suspicious organizations or individuals. Above all, the linking of data across national
borders is of particular importance for this area.
A specific feature is the recording of security cameras in public places. Although their
actual function is to prosecute crimes that have already been committed, they are installed
in practice mainly for crime prevention at certain hotspots.
Social media platforms are both an opportunity and a challenge for security authorities. The
analysis of user and usage data provides new opportunities for law enforcement. But, on the
other hand, these virtual spaces are themselves the scene of real crimes, which are chal-
lenging to get to grips with using traditional criminalistic methods and tie up personnel,
who are in turn lacking in public spaces.
Traffic Management
Every driver knows the scenario: you are driving comfortably in the city and suddenly see
a red light. As a conscientious driver, naturally, you stop. You look to the left, to the right,
and to the front. The road is clear, and you waste valuable time waiting there.
Now imagine a system in which traffic light control is automated on-demand. You are cruis-
ing towards the intersection. Since you are the only one at the intersection, the signal is not
red, and you can proceed unimpeded.
Smart Cities
In the Smart Cities use case, many things that have already been mentioned or will be
mentioned in this chapter become visible. Autonomous vehicles will change city streets in
the coming years, just as cityscapes are already changing due to the transformation of
commerce. There will have to be new ways of meeting the energy needs of smart cities and
distributing energy efficiently. Art, culture, and media, while becoming increasingly digital,
are still being created in communities that need a space to implement new ideas.
We can divide smart city concepts into two opposing planning approaches. Prominently
represented in the media are the mainly technology-centered top-down initiatives driven by
infrastructure and product providers. The promise here is to optimize the overall city sys-
tem by collecting and processing information about its current state and making it available
to decision-makers in real-time in so-called city dashboards. However, citizens often also
have access to it to adapt their behavior to changes (e.g., in public transport). Singapore is a
prime example of this approach.
In contrast, the second approach aims to connect citizens through ICT (Information and
Communications Technology), enable participation in urban policy-making, and foster local
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.6 Government 739
communities. The focus of these initiatives is usually not on the whole city but limited to
interventions in smaller sub-areas. Currently, Vienna is a good and internationally re-
spected example of such a strategy.
For urban planning and development, the application of ICT and data technologies results
in a paradigm shift. Decisions that were previously based on expert opinion can now finally
be evidence-based. Of course, this does not mean that we can replace the mayor of a city
with artificial intelligence. But planning and, above all, control tasks at a low level will be
based on data and automated.
Mobility is probably the topic with the most significant potential for effective smart city
initiatives. Of course, the aforementioned autonomous driving will fundamentally change
individual transport. But public transport will also be intelligently controlled and adapt
more than before to the temporal rhythm of the citizens of a smart city.
Other core areas of the development towards a technical smart city are energy and water
supply. Again, the introduction of data-driven planning and intelligent control will lead to
more careful use of resources.
The backbone for these new developments in the city will, as always, be the infrastructure
provided. On the one hand, we must modernize the existing infrastructure in transport,
electricity and water. This modernization means making its components more measurable
and thus controllable. On the other hand, however, we must create an urban data infrastruc-
ture in parallel, which combines all the smart city-data.
Hazard Identification
Even if official statistics say that violent crime is decreasing, there will probably never be a
non-violent society. Therefore, many cities monitor public places via CCTV (Closed Circuit
Television). For example, in the UK there are cameras in public transport.
How much surveillance we need is a matter of debate. Still, presumably, there is a consen-
sus that we can use video to record crimes and that video recording can deter people from
committing crimes. But what if CCTV films someone pulling a gun? With many video
streams, we can use the material only for “post-mortem” analysis. In other words, authori-
ties analyze the data after the crime has already been committed, as there is not enough
staff to monitor the videos continuously.
There are solutions already that utilize image recognition to detect weapons. However, in
more advanced scenarios, threat detection solutions might include body language and
spoken words to assess a dangerous situation.
National Defence/Military
We have already described in threat detection how the state can benefit from homeland
security. Object and threat detection are, of course, also crucial in the military. For example,
if airspace surveillance identifies an object as dangerous, it can be intercepted.
It is tough when there are “false positives” here. In the 1980s, there was a missile warning
in the Soviet Union, but it later turned out to be a false alarm. We have one person to thank
for the fact that no cruise missiles were fired.
Each military sector, from airspace security to the navy, has its use cases, the detailed
analysis of which is beyond the scope of this book.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
740 24 AI in Different Industries
Civil Protection
Professional data evaluation during disasters can save lives. For example, let’s assume it has
been raining or snowing for days. Knowing where to expect avalanches or mudslides means
being able to cordon off regions and save lives.
We can find hazard zones using historical data or by matching satellite imagery. Unfortu-
nately, in the course of climate change, there will always be new requirements in this area.
24.6.4 Challenges
We can divide analytics for governments into two areas: intelligence and citizen services.
Intelligence is the area that makes data processing a danger for many because the bottom
line is that it is nothing more than surveillance. But, of course, the buzzwords ‘counter-
terrorism’ and ‘security’ can be used to gloss over many things. Still, it is also legitimate to
ask what impact citizen surveillance has on those citizens’ lives.
Perhaps the most significant challenge is transparency. It’s no secret that Amazon has its
gov cloud, available only to governments. Nor is it a secret that the NSA has vast data cen-
ters of its own. Yet, whistleblowers like Edward Snowden or Chelsea Manning have shown
that governments deal with PII data without restrictions. Of course, one can argue that
there is existing law and that it also applies to states. But there are enough regulations that
undermine established law in the name of security.
Digitalization can facilitate direct democracy, but we should question the maturity of citi-
zens themselves. Does it make sense for people to vote on a whim when they don’t know
what’s at stake? We described data science as a method to reduce ourselves to facts and
include less emotion in the decision-making process. But what if the research is about
which topics stir up emotions most?
How much does the promise of empowering the population also open the door to their
manipulation? If we know about the people we want to influence, we also know what we
have to say to make them like us. So then, we should not be surprised when political parties
specifically adapt the messages to their voters and populism in every form spreads in
politics. There are still many open questions here about how politics can be responsibly
digitized.
■ 24.7 Art
Art is perhaps one of the areas where many readers would least expect to find applications
of artificial intelligence. After all, art is considered by many to be the domain of human
creativity. Thus, it is hard to imagine a computer creating work as expressive as a Mona Lisa
or an epic like Lord of the Rings.
However, art offers some application areas for artificial intelligence, such as creation and
trading. For example, patrons make a living selling art and promoting artists. Thus, we can
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.7 Art 741
use AI applications to estimate prices of artworks via regression models and possibly also
to discover new talents whose works match current art trends.
Artists are also using AI themselves to create art. And Microsoft has published Drawing-IT,
a system that links art to AI. Thus, the connection between AI and art is not as far-fetched
as it first appears.
24.7.1 Vision
You are redecorating your apartment. You notice that a picture is missing somewhere:
“Alexa,” you say, “I’d like to have a picture there that brings a little more life into the room.”
Alexa, or, more accurately, the digital assistant called Alexa, asks you a few questions. Then,
it shows you a few examples of different artists and gathers information on your preferences
little by little. Finally, an order goes to the 3D printer, and you are surprised to see a result
that seems to match your taste. Ultimately, you express yourself through the pictures you
hang or the music playing when a guest enters your home. All of this is part of how you
show yourself to the outside world.
You are talking to a friend who has written a book. Artistic creation has also changed in this
area. Your friend has outlined the book and had parts of it written by an AI, enabling him to
make his book multi-dimensional. The interactive plot means different plot alternatives,
which the writing AI adapts to the reader’s preferences. Without AI, this step would not
have been possible on a large scale.
24.7.2 Data
What data do we need to evaluate artworks or have artworks produced by an AI? Raw data
for this is images, videos, and music. The more art data you feed into an AI, the more you
can apply regression and classification algorithms.
We can analyze people’s genuine reactions when they see a work of art. This data can come
from video streams, for example. Algorithms determine people’s moods based on the facial
expressions that are recognizable in the image data.
It is also helpful to have profiles of art buyers. For example, if we know which artworks
people have bought in the past, we can also use this data to determine which paintings the
customer might like in the future using recommendation engines.
Price Calculation
We want to determine what price a work of art can fetch by analyzing data. As input values,
we have artworks, demographic data, and historical data from auctions. Then, via machine
learning algorithms, we calculate the possible price the artwork can achieve.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
742 24 AI in Different Industries
Artificial Muse
Many artists are experimenting with linking AI and their work. One example of such a
group of artists is the project ‘artificial muse.7
An artist paints pictures, and an AI engine analyzes these pictures via deep learning
algorithms and, based on the evaluations, tries to inspire the artist.
Although it will be some time before artificial intelligence can write complete novels on its
own, an AI engine is already good at recommending style elements via linear regression
that can underscore a plotline. For example, an author is working on a scary novel and
wants to describe a house of horrors. An analytical model can then suggest what details the
author should refer to in the process. It can also identify which stylistic elements are
currently popular in literature and would resonate well.
Reaction Measurements
Art is often also about triggering and sometimes provoking reactions in the viewer. Some-
times, it is difficult to predict the effect of artists’ work in public space, for example. How-
ever, the responses to art in public space can be measured explicitly by analyzing the
viewers’ behavior via video streams.
24.7.4 Challenges
Presenting art in the same breath as data science may sound provocative, even for a field
that sometimes draws attention to itself through provocation. After all, art is considered the
last bastion of human creativity. For many, it may be inconceivable that AI should replace
geniuses such as Picasso, Goethe or Mozart in the future.
However, technology and art have long been growing together in particular areas. Many
artists use technology to incorporate it into their work.
■ 24.8 Manufacturing
When we talk about Manufacturing or Production, we talk about the industries that produce
mass products in factories. So we use the term ‘manufacturing’ to talk about the manufac-
turing industry itself, and with the term ‘production,’ we reference the mass production
process.
Automation has always been an essential part of production. Production managers want to
ensure that they reduce manual operations, increasing quality and saving costs.
We can use data in manufacturing to improve production processes. Of course, companies
will continue to exist for the mass production of products. But if we look at the innovation
thrusts in 3D printing, for example, we see that this technology will also be suitable for
7
www.artificialmuse.ai
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.8 Manufacturing 743
widespread, individual usage at some point. Private individuals will then be able to print at
least simple objects themselves.
24.8.1 Vision
You are on your way back from Paris when you realize you have forgotten to buy a souvenir.
You had intended to buy a coffee cup set, so you let your AI assistant know. He takes care of
it.
Once home, a surprise awaits you. You live in a time when part of the production becomes
local again. The producer of the coffee cup set no longer sells the product but releases the
data for individual products in exchange for payment. A 3D printer at your home has printed
the coffee cup set for you in excellent quality. That we will be able to print various household
items at some point is beyond doubt. The question that arises is what level of complexity 3D
printers will master and by when.
In mass production, on the other hand, automation has progressed so far that hardly any
people perform manual work steps anymore. The world of data has supported automation
here. Machine learning and co make it possible to configure the bills of materials so
that there are no more inventories. This efficiency has a beneficial impact on resource
consumption.
24.8.2 Data
Sensor data from machines often generate several terabytes of data per hour. Frequently,
this production data is specially coded and differs significantly from human-readable data.
We cannot move all real-time generated data to the cloud in a reasonable time. Often the
throughput rate is too low. Imagine a factory, perhaps not necessarily located next to a back-
bone to a cloud provider, producing terabytes of data. That data needs to be able to be read.
Hadoop may have lost its popularity in recent years, but it is still suitable for local bulk data
stores where data needs to be written quickly.
Manufacturing companies are also heavily dependent on commodity prices. What data do
we need for calculation models on price developments? We need data on the situation in
countries that supply these raw materials. If we can calculate the impact of political changes
on commodity prices, we can decide whether or not to buy.
If our suppliers operate in countries with a tense political situation, we also want to know
about the likelihood of political turbulence. For this purpose, personal data on influential
people in a country can sometimes be interesting. For example, suppose a dictator rules a
country. In that case, it can make sense to apply Natural Language Processing to text data
crawled from news and social media to determine whether the situation has changed.
At the same time, we can see what sales we had in the past from our historical data. We can
build models through which we compare the data from the past with the factors of the pres-
ent and thus determine the probable sales through regression and adjust our production
accordingly.
Quality Improvement
Reading and evaluating sensor data on production lines is a typical machine learning use
case. We can measure every step from raw material emulsion to heating, forming, punching,
and quality control with sensors.
We can link sensor data to production waste via regression. If, for example, a certain num-
ber of increased sensor values can be connected to a higher proportion of defective goods,
we can initiate corrective measures.
You can thus minimize a scrap of your production, and sometimes you can also determine
how individual phases in production affect others. This quality optimization can help com-
panies in countries with higher wage levels to remain competitive.
Exercise
Have an excursion to a production mill and take a tour. While they explain the
whole production pipeline to produce goods, try to imagine how you could
leverage the data collected by sensors. How could you, for instance, try to
detect rejects in a pipeline?
24.8.4 Challenges
The quality of the data coming from sensors can sometimes be poor. Many machine builders
have not optimized their data production for data science applications. In addition, there are
sometimes misalignments between IT departments in manufacturing and the production
managers. Sometimes production doesn’t want to be told by IT how to do its job. We cannot
implement analytical use cases if production has a “we don’t need it anyway” veto. At the
same time, Western producers, particularly, are dependent on quality to justify higher labor
costs. Therefore, if the quality in countries with high labor costs is not noticeably better
than in low-wage countries, management will outsource production in the long term.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.9 Oil and Gas 745
24.9.1 Vision
An oil and gas executive from startup AI-Oil has acquired the concession for an old oil field,
as well as the old drilling and production data. The data is a big part of the cost, but the
startup’s owner knows it will be worth it for the company.
The company’s self-developed AI program reads through the data. Soon, a green dot lights
up on the map that models the field. Then, another. And another. Slowly, individual clusters
form. After the program has worked its way through, AI-Oil’s lead data scientist runs an-
other function, and the possible production quantities appear on the screen. The result
looks good. Once again, it has been proven that it is still possible to “squeeze” some oil out
of fields that were already considered exhausted with more precise calculation models.
A service company immediately provides the estimated costs for the infrastructure, calcu-
lated in real-time with a view to the current workload.
A few more clicks and the program is connected to the data of the global financial markets
via add-on. The forecast of supply and demand is refined by automatically reading and eval-
uating press releases from all companies. Here, a factory for ultra-light plastics for aircraft
construction is opening; an old chemical plant is closing. The impact of these news items is
quantified and taken into account for the next few years. And if you know the demand and
the historical data, you can also determine how prices will develop.
In an interview with the press, the founders of AI-Oil express confidence. Soon, they are
sure that not only oil but also money will be flowing again.
24.9.2 Data
Few are aware of the petroleum industry’s pioneering role in data collection and processing.
Accurate data has always been necessary, from geology to chemistry to mechanical engi-
neering to proprietary petroleum engineering. As a result, vast amounts of data sets of all
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
746 24 AI in Different Industries
sizes and shapes are generated, from the first wells with a few dozen data points to area-wide
4D seismic images taken every millisecond.
And that’s just the static data; operational data from a global infrastructure– from thou-
sands of mobile devices and personnel to constant sensor monitoring of pumps and the
pipeline network – is added all the time.
Hardly any other industry is so dependent on the global economic situation. That’s why
quite a few oil companies invest a lot of time and energy in evaluating economic data and
news. How, for example, will the political climate in Kuwait or Ecuador affect production?
Could developments in the US or China curb or increase demand?
The industry has been instrumental in developing and implementing the latest technology
throughout the data cycle. As a result, new sensors have been developed for data acquisition
that can operate in the most adverse conditions, under hundreds of bars of pressure, in toxic
and corrosive environments, and in extreme temperatures, both inside the Earth and on the
surface in the Arctic, with unprecedented accuracy.
Microchips were developed for processing, which later found use in virtually every home,
whether inside Intel computers in early applications of GPS or in connecting to the Internet
in remote areas. To process the terabytes of seismic data these microchips accumulate, the
world’s largest supercomputers include petroleum companies. Moreover, the data is
presented in a visually appealing way in 3D projection spaces.
Data protection is, of course, essential for petroleum companies. As one of the most profit-
able industries, business data theft can bring devastating damage. Chip cards for identifica-
tion, now found at every ATM, originated in the oil industry.
Field Discovery
Until a few years ago, seismic data was mainly interpreted by hand. Computers were used
to process and display them, but experienced geologists did reservoir recognition. Field
discovery is about visual pattern recognition, which is now gradually being automated by
machine learning.
What used to take months is now done in days. We achieve a decisive improvement by eval-
uating existing data with new methods so that new measurements are either not necessary
at all or only in a targeted manner.
Another aspect is the rise of fracking, also called hydraulic fracturing. We fracture bedrock
formations by a pressurized liquid and therefore have an alternative way to extract fossil
fuels. There are many data-driven explorations to be done to find the best sites for fracking.
Scientific Data
On the one hand, new machines and sensors are being developed in laboratories. But, on the
other hand, exploitation of oil reserves is only possible through precise geological and
chemical knowledge of the earth’s interior. Therefore, a lot of deterministic research is be-
ing done on the processes. But since the rocks hidden inside the earth are difficult to grasp
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.9 Oil and Gas 747
and we cannot recreate everything in the laboratory, many findings are based on stochastic
and statistical methods.
Thus, the work of a data scientist can become a “home game.” While in other industries,
they still have to justify their way of working because many people are not aware of the
background and don’t have the technical knowledge to understand it, they can assume that
experienced employees in the oil industry will work better with them because they know
their requirements.
Predictive Maintenance
Oil and gas is a heavy industry, perhaps the “heaviest” there is. Machines as big as build-
ings, ships, vehicles, pipelines and refineries are as much a part of it as uninterruptible
processes that run for days or weeks. And all of this is usually on the high seas, in the
Arctic or the middle of the jungle, with an arrival time to the production facility of two to
three weeks. If a device fails or individual components fail, costs rapidly grow into the
millions as production comes to a standstill in the worst-case scenario.
Predictive maintenance is just as important here as knowing which spare parts you will
need. For this reason, the industry has one of the largest redundancy inventories of any. So
every possible optimization here means savings in the millions.
Predictive maintenance can help to detect wear and tear on equipment by evaluating sensor
data. At the same time, it helps to determine, based on historical data, which spare parts
need to be on-site and in what quantity in order to enable smooth operation.
Process Management
After we find an oil well, setting up and maintaining a production facility is a huge under-
taking. Thousands of specialists need to be coordinated, hundreds of transport vehicles
acquire working materials and haul away raw materials. This process involves numerous
different companies that also need to be coordinated. Whole floors of planners are employed
to ensure that everything runs smoothly.
Here, we gain a lot of efficiency through machine learning and process optimization soft-
ware. A concrete example is the modeling of demand. As soon as a model is generated that
predicts the need for various components, the planners’ work is made easier.
Geo-Engineering
The Paris Agreement demands zero carbon emissions until 2050. This is terrible news for
an industry that depends on fossil fuels. However, there are research projects on removing
carbon from the air with new technologies or methods, called geo-engineering. If there is a
breakthrough, the removal of combustion engines might be deprioritized.
No matter if it is about devices to bind CO2, mirrors reflecting sunlight, breeding plankton
farms, a lot of data is involved. Therefore, it can be an exciting field for data professionals.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
748 24 AI in Different Industries
24.9.4 Challenges
Despite applying new, disruptive technologies, the oil and gas industry, dominated by a few
substantial companies, is very conservative. The challenge here is not in the central busi-
ness as the latest and best methods are always used in exploration and production, giving a
clear advantage over the competition.
But it’s hard to bring corporate governance into the digital age. These are huge companies
with hundreds of thousands of employees working in dangerous conditions in hundreds of
countries. Millions and billions of euros are at stake here.
All companies are committed to digital transformation. The challenge is to bring dozens of
different departments and functions up to the same standard. The organizational hurdle
here is many times higher than the technical one.
However, those who manage to digitize all processes, from personnel management to mar-
keting, and from maintenance to operational planning, and who can connect all their data
effectively will realize substantial efficiency gains.
■ 24.10 Retail
Many people don’t think of a supermarket in terms of data analysis. For them, a supermar-
ket is still a domain where people deliver goods, people sort goods, and at the checkout,
there are usually still people who receive a customer. In other words, an industry in which
data analysis is not relevant.
But the reality is different. Product placement is an all-important philosophy that can
become critical to the existence of suppliers. Store managers have to place products with a
high probability of purchase optimally. It is also essential that neighboring products should
give the buyer ideas about what else to purchase. Ideally, the customer is presented with the
products logically in a structured order from the beginning of their journey through the
supermarket, right up to the checkout.
24.10.1 Vision
Your self-driving car drives you home from the office. You know full well that your personal
AI assistant at home will organize the refill of the fridge and freezer with fresh goods in
your absence. This automation means that, in theory, you don’t have to worry about any-
thing. If something runs out in the household, an AI will reorder it. Based on your past
personal preferences, Alexa, Siri and Co. know what should not be missing. And if you ever
get a craving for something new, all you have to do is say, “Alexa, tonight will be a schnitzel
night.” The only question Alexa might ask you is whether you want to prepare the schnitzels
yourself or have them delivered ready-made. But as a rule, Alexa will be able to calculate
this request herself.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.10 Retail 749
Shopping in a supermarket has changed. Sure they still exist, you can find them on every
corner, but shopping looks very different now. Instead of putting goods into an actual shop-
ping cart, shopping consists of trying and looking. With all the automated services available
now, manual shopping is all about the experience. Supermarkets are paying more attention
than ever to the shopping experience to give customers an incentive they wouldn’t have in
an online store. In addition, the payment process is handled in the background.
24.10.2 Data
Customer experience managers can attach sensors to shopping carts to track customer jour-
neys. For example, where standing times occur, it is essential to analyze whether customers
cannot find the products they would like to buy.
Video cameras detect when people are entering and leaving. But, of course, it would violate
data protection rules to identify individuals. Still, it is at least possible to decide how many
people are in the store and, if necessary, also determine a few characteristics about them.
For example, what is the approximate gender distribution, estimated average age, and so
on?
Another topic is sensor data on products or shelves. Here, too, it is possible to deduce
customer behavior. For example, an AI can determine which shelves need to be restocked
and when. Models from this data can help to optimize stock levels.
AI-assisted Purchasing
In the vision, we had Alexa, Siri, Cortana, and co place orders themselves to make sure you
have everything you need. In this use case, we assume that you want to go shopping your-
self.
Electronic AI assistants may send you different products while you’re in the supermarket.
“Alexa, where can I find the spices in this supermarket?”
“Two shelves over, you’ll find them. Do you want a recommendation?”
“Sure. What do you recommend with moussaka?”
By scanning the product codes, the AI could also find out whether there are any substances
in a product that you or someone in your family is allergic to.
Product Placement
As a market operator, you want to know which products you need and which ones you don’t.
Which ones sell well, which ones don’t? Which product group B will a customer buy if it is
next to product group A?
Product placement is one of the central topics for optimizing a supermarket in terms of
purchasing efficiency. Products that logically belong together are grouped, and an optimized
assortment is used to try and prevent slow-moving goods.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
750 24 AI in Different Industries
AI can generate models that perfectly optimize each supermarket, adapting placement to
the environment and typical customer demographics. It can also identify buying trends at
other locations and derive marketing campaigns for individual products.
24.10.4 Challenges
Retail has traditionally been a conservative market. Some providers do not even offer their
customers so-called loyalty cards. Management often tries to keep IT costs to a minimum,
and POS systems usually work with old software.
Covid has created a lot of pressure on retail. Many customers have become acquainted with
ordering online and e-commerce software offerings have increased. Companies selling prod-
ucts only online also have fewer costs and therefore can offer their goods cheaper.
Until now, Amazon has only been considered a competitor for suppliers of non-perishable
goods. Now, the company is also putting pressure on well-established supermarket chains.
Amazon invests in brick-and-mortar stores and delivers groceries to customers’ homes via
its subsidiary, Amazon Fresh.
The business may not be profitable yet. Older people, in particular, are used to going shop-
ping and don’t see the appeal in an online grocery store. Digital natives, however, who are
familiar with ordering online from a young age, will also order groceries online. The l ocation
of supermarkets further influences the acceptance of virtual grocery shopping. Suppose
there is no supermarket near the home. In that case, people will be more inclined to try
alternative virtual shopping methods.
Another aspect driving the virtualization of all purchasing processes is self-driving cars and
the resulting optimization of delivery logistics through AI, which will significantly reduce
shipping costs.
would be worth nothing without internet access. All services, such as YouTube, are pro-
vided through telcos. It is similar to Netflix: without high bandwidth, the use of this service
is not possible.
Telcos have access to a lot of data, and this data offers plenty of opportunities for new busi-
ness models. We can identify three critical areas that have future potential:
improvement of customer relations,
internal optimization and
the possibility to build up new business areas with data.
24.11.1 Vision
Imagine you are walking on a shopping street. You have shared your profile with your inter-
net provider for data analysis and receive complimentary internet in return.
While you are shopping, you receive some special personalized offers based on your loca-
tion. After finishing shopping, your phone recommends an event in your area that matches
your preferences. A suitable companion is also quickly found, as there is someone nearby
you know who shares your interests and with whom you can attend the event together.
Before that, you go shopping in a supermarket. Since this supermarket has also analyzed its
customers using telco data, you immediately find what you want: Your mobile phone guides
you to the goods.
24.11.2 Data
Churn/Customer Loyalty
An existing contract hardly causes any costs for a provider. Whether a customer makes
more or fewer calls makes no difference from a cost perspective. Either way, the infrastruc-
ture must be available to offer the services. Consequently, all telcos need to retain their
customers as much as possible.
Analytics can help build models that assess a client’s loyalty and predict the likelihood of
them abandoning the company, which is referred to as the ‘churn rate.’
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
752 24 AI in Different Industries
Location-based Offers
Customers moving along shopping streets can be offered matching deals. The telco recog-
nizes that they are in a particular position. Suppose a match to the customer preferences is
found via AI. In that case, the customers are informed about special offers only if they have
consented to use their data for these purposes.
There are certain opinions that telcos can earn more via location-based offers than via
contracts. In this way, at least some contracts could also be financed via advertising in the
future.
Mobility Solutions
Telcos have large amounts of data about the movement behavior of their customers in their
mobile network. Based on this movement data, a telco can precisely tell how many visitors
have attended an event or have seen an advertisement. Based on mobility data, this is, of
course, particularly interesting for outdoor advertising. However, these solutions are even
more interesting for retail: A telco can tell a mall very precisely how many users and what
kind of users are there (based on income bracket, socio-demographic characteristics, and
the like). Another advantage here is seeing which customers are going to the competing
supermarket around the corner.
Network Planning
The planning of the 5G network is primarily done based on data. For this purpose, telcos
analyze their customers’ location in order to determine who would switch to a 5G tariff
early. Since the 5G rollout will be very cost-intensive, the telcos will implement it in the long
term and first and foremost in areas where customers are willing to pay more for a 5G
add-on package.
Another critical point here is the improvement of the existing network. For example, CDRs
can be used to identify where calls are repeatedly dropped. Algorithms can then decide
where to invest. This is important because customers who experience poor network quality
are more likely to switch providers.
Marketing Automation
A topic that affects very many industries is the automation of marketing measures in the
company. Especially in saturated markets like the telecommunications industry, there is
hardly any growth potential left. Therefore, it is imperative to be able to address customers
directly and relevantly. This only works if you have a comprehensive customer data platform
(Customer Subscriber Data Platform). This then decides which customers to address in
which channels and with which message. The offers are highly personalized. This means,
for example, that customers who have been using an iPhone for years also receive corre-
sponding offers since there is probably hardly any interest in another smartphone.
Call Centers
Telcos often have huge call centers in which they serve their customers. There is a multi-
tude of optimization possibilities here. One major challenge is the management of “caller
peaks.” Here, a telco tries to calculate the peak load and design the call center based on that.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.12 Transport 753
With intelligent algorithms, this load can be better distributed, for example, by putting
customers with a high priority in the front of the phone queue and customers with a low
priority behind.
Likewise, telcos may inform customers about problems in advance. For example, when a
network outage or network problems occur, the number of calls to call centers goes way up.
Therefore, proactive information can significantly improve the service.
Another future scenario is voice recognition. You will call the call center in a few years, and
you will first speak only with computers. These are already so intelligently designed that
humans will hardly recognize this. Only when the algorithm no longer knows any answers
will you be connected to a human.
24.11.4 Challenges
Customers will understandably not always agree to their data being evaluated. Accordingly,
it is crucial to adhere to data protection rules so that only those customers receive data-
based offers that expressly agree.
Also, the data quality delivered over mobile networks is not always satisfactory. This affects
position data, for example, as different terrain affects the quality of the radio signals. Meth-
ods such as triangulation can improve location determination, but these methods also have
limitations.
■ 24.12 Transport
This section will focus on freight transport, which freight forwarding companies carry out
by road and rail. The core expectation is that goods arrive safely and on time. Therefore,
resource planning, pricing, and cost reduction are essential to companies.
There is another aspect of AI, especially regarding reaching sustainability goals: Optimizing
logistics to reduce the number of freights or kilometers on the road. For example, algo-
rithms can find more efficient routes or combine deliveries.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
754 24 AI in Different Industries
24.12.1 Vision
You have purchased a product through an online shop, which is an original that must be
shipped. We emphasize this because, in the society of the future, many things can also be
produced in-house by 3D printers.
An application will show you how long it will take to get this product to you immediately.
In the days before the information society, there were lots of steps along the way. First,
someone had to package the product, take it to the post office, where it was sorted, f orwarded
to distribution points, etc. Then when the package was with you in a few days, you had to be
home, because otherwise the postman would just leave you a notice and you’d have to go
back to the post office. But that wasn’t the end of it. After you had unpacked the goods, you
also had to dispose of the packaging material.
But we live in a digital society, where everything is automated. So after the purchase, the
journey starts immediately. Robots pack your goods in a standard reusable container, which
you can return when you receive the goods.
You can always track your product during delivery and when it will be with you via the
transport network. From a central warehouse, robots put your goods on an autonomous
driving truck. On various waypoints, some goods might be unloaded just to be loaded on
different trucks by robots. The whole transport network from the store to your home is a
huge graph in which sophisticated algorithms have calculated the most efficient path to
your home. Some also talk about an “IRL TCP/IP system,” meaning a TCP/IP system in real
life. If the package is small enough, fast delivery is also possible through drones.
Delivery services can align the delivery of packages with your presence at home, or you can
grant a delivery service temporary and monitored access to your home. In addition, if a
drone delivers a parcel, it can also be left on a terrace or balcony.
24.12.2 Data
Imagine a digital twin based on a vast graph system that maps all routes and means of
transportation which are currently transporting products from A to B. You can turn the
entire transportation system into a TCP/IP network. For example, vehicle A brings goods 1,
2, and 3 to Deposit A. Goods 1 comes to vehicle B, which travels to another node, where
these goods are in turn shipped to vehicle C.
The entire route calculation is optimized so that each product reaches the end customer
with as little effort as possible.
Predictive Maintenance
In all transport companies, defects are expensive. Imagine a train is canceled or delayed. In
some freight, penalties are incurred. A delay can mean that other trains are also delayed
with the railways, especially if they have to wait for a train to depart.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.13 Teaching and Training 755
Predictive maintenance provides the information that maintenance staff needs to detect
possible defects more quickly. You can use sensor data and measurements to determine
whether defects are probable shortly and whether it makes sense to replace worn compo-
nents beforehand.
Resource Management
Imagine you want to optimize the loading of a truck or a freight car. How you arrange your
goods and load individual containers and cargos to maximize resource utilization matters
significantly.
Data Science can optimize the load distribution on your transporters, which helps you save
costs.
24.12.4 Challenges
From a social perspective, many view self-driving cars in transport ambivalently. For exam-
ple, the working conditions of haulage drivers are often strenuous. Even with only a few
breaks, they hardly manage to meet ambitious delivery deadlines. So, drivers would benefit
from autonomous trucks if they could “share the route” and rest in the truck while the
autopilot is active.
But what if a truck can go the road seamlessly? Many drives might get worried about their
jobs, although it might take a while until they are fully replaceable. We may see Uber as an
alternative form of taxi service. Still, existing taxi companies in many countries tried to
fight Uber through legislation and lobbying. If there is already a resistance against Uber,
which still employs human drivers, we can expect more resistance when services can
replace (truck) drivers.
There had always been outliers in our societies. Sometimes they are mavericks
who challenge the status quo and bring us forward. Sometimes, however,
these people would be called “village idiots” or strangelings in earlier times. It
is a different debate if the outcasts just missed the right moment to become
mavericks or if fate was just cruel to them.
As they do not fit in, they often try to get attention and do or say weird things.
Mothers teach their children to ignore village idiots, and commonly, the rest
knows from experience as there had been too many incidents that wisdom is
not one of village idiots’ traits.
In social media, those outcasts who would be otherwise “village idiots” may
gain an audience as no one right away understands that they are quirky. So in
the future, if you have a heated discussion with someone, try to imagine that
you might be talking to the kind of person that your mother would have
warned you about and stop wasting your precious time.
On the other hand, if you cannot remember who the “village idiot” was in the
past and remember that people already had heated discussions with you or
ignored you, you should start asking different questions.
24.13.1 Vision
The more AI that is available in schools, the more individual attention students can receive.
But, unfortunately, many of us still grew up in a school system where teachers primarily
gave frontal instruction.
The modern school begins with a placement assessment to determine what the child brings
with them regarding resources and abilities when they enter school. The ultimate goal is to
nurture talents, not force the child into a system that dictates what is right or wrong, regard-
less of their abilities.
Performance assessment uses the full range of data science methods. In addition, we can
analyze the language of students to assess how they approach problem-solving. The evalua-
tion also explores how students take in information, visual or auditory, for example.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
24.13 Teaching and Training 757
The content transfer takes place digitally to a large extent and contains constant quality
improvements. In modern learning centers, teachers change from frontal teaching to men-
toring. The interaction between individuals is essential as we can verify immediately
whether the student has understood the content or not.
Playfully, the children learn new skills and acquire knowledge. Curricula no longer exist;
learning objectives are adapted individually. AI also helps to understand which teaching
methods work better with which students. According to the principle of “more of the good
and less of the bad,” we can adapt didactics individually.
Mentors prepare children for a changing time. Students, therefore, learn only the most
essential facts because they can look up details from the Internet. But, above all, they learn
behavioral patterns to identify objective information and separate them from untruths
successfully.
New data-driven learning platforms also teach behavior patterns, such as solving problems
efficiently and dealing with unexpected challenges. In the past, some students had learned
how to prevent mistakes to get good grades with mediocre but less error-prone works. In the
future, students are encouraged to try new things and take risks as their mentors expect
them to make mistakes and to learn from constructive feedback.
24.13.2 Data
Data for education can be, for example, any form of testing such as placement tests, but also
teacher evaluations and the way teaching materials are used.
Data from test results needn’t necessarily reflect right or wrong. Natural Language Process-
ing can also be used, for example, to draw conclusions about expressive ability and whether
students may be acting in destructive patterns that prevent them from learning success-
fully.
If students use social media, supervisors can determine in which direction they are devel-
oping. We can generate models to understand how they learn and think.
MOOC
MOOCs (Massive Open Online Courses) have become an integral part of the educational
experience. Udemy, Udacity, Pluralsight, EDX, Coursera, and many other online learning
offerings are flooding the market.
In the race for the best quality, one question is how to measure the courses’ quality and
detect room for improvement. Today, quality assurance systems get feedback from users per
lesson and calculate dropout rates or how often students repeat segments of a video.
AI will enable us to assess the efficiency of content in new ways. For example, once we can
determine when a significant number of students lose their attention span in a video, we
might adjust the content on a segment level. In addition, MOOCs might become far more
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
758 24 AI in Different Industries
interactive, and this also means that we can deduct the efficiency of exercises by looking at
students’ behavior.
Personalized Learning
All students have individual strengths and weaknesses. Training institutes can build pro-
files that record how someone thinks and learns. Supervisors can compare personal profiles
and assemble learning groups that could fit together based on an evaluation.
Curriculums can offer individuals courses in the form of a recommendation engine. The
more feedback students give, the more accurately an AI-based teaching assistant recom-
mends courses to students that match their skills and potential.
24.13.4 Challenges
Critics claim that the existing school system is not flexible enough to adapt to the demands
of an economy with dynamic priority changes. Instead, they see educational systems as
rigid organizations where teachers have acquired a fixed routine in transmitting knowl-
edge. As a result, those who have been used to this routine have difficulties adapting to a
new teaching style.
We outlined in the use cases that the more we study students’ behavior during classes, the
more we understand how they learn, the more we can improve their learning experience.
However, these assessments might go so far that we measure their attention level through
facial expressions. Therefore, we can see a risk that the detailed analysis of how students
learn is an invasion of privacy and problematic for data protection.
innovation replaces teachers, taxi drivers, and supermarket employees with machines just
as gradually.
Science Fiction literature shows options for utopian and dystopian outcomes. Some see
digitalization as progress, others as a threat. So the essential question we have to ask our-
selves is: What is the role of humans in a fully automated world?
Wikipedia lists some sources discussing whether governments should tax machines instead
of labor8 in scenarios where a machine replaces human labor to “make AI compatible with
social systems as well.” There is also a discourse that suggests that we need to question our
economic system if we want to live differently in the future. A financial system that thrives
on income from human work may no longer fit the purpose in the age of automation. We will
also have to ask ourselves whether everyone needs to work to sustain society. According to
Maslow, once we have fulfilled our basic needs, the purpose will become more important for
humans. Therefore, human sciences may experience a renaissance via digitalization.
For skeptics, the automated society is an illusion. They stress that enthusiasts vastly over-
estimate technologies, and citizens underestimate them. They fear a surveillance society
fostered by Big Data. And they want to fight against it.
You can counter that the generation that lived at the beginning of the 20th century could
not have imagined a mobile phone either. If you had told someone 50 years ago that we
could receive global knowledge through the Internet, many would not have believed you.
Innovations come in spurts, and each new trend triggers a chain of events. For example, in
2005, YouTube came on the market; today, people make a living by marketing themselves
on YouTube and driving up their viewership numbers. People also make their living as pro-
fessional computer gamers. And others advise people who have burnout caused by digitali-
zation.
One should not leave dystopian visions unmentioned here. Many also see Big Data as a
precursor to worlds like those depicted in ‘1984,’ ‘Terminator,’ ‘The Matrix,’ or ‘Minority
Report,’ in which intelligent machines take control and enslave humanity. Fears range from
the total annihilation of humankind because computers classify it as a virus to the fear of
total surveillance. But even less apocalyptic scenarios strike many as frightening. A nega-
tive example of a data-driven company could, for example, also conclude that it can do
without various employees and terminate them for this reason. Such a scenario is not par-
ticularly optimistic either. Identity theft may happen as such in ‘The Net.’
We also have to be aware that militaries will fight future wars with drones, and we have seen
that these harbingers of destruction with the help of AI can acquire targets on their own.9
At the same time, we might find a counterargument in another example: In ‘2001: A Space
Odyssey,’ HAL 9000 is often seen as an evil computer. But in another way, we can interpret
his actions as quite egomaniacal, a human trait. He believed that he simply could not err.
This phenomenon has been seen a lot with narcissistic leaders. How many leaders were so
convinced about themselves that they accepted total destruction? We fear computers might
control us at some point, but who is scared about humans who govern us with mass hyste-
ria, as often seen in fascist systems?
8
Wikipedia: Machine tax, 2018
9
https://www.independent.co.uk/life-style/gadgets-and-tech/drone-fully-automated-military-kill-b1856815.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
760 24 AI in Different Industries
The digitalized society brings challenges as well. It will always be possible to get out, go into
nature, and live far away from the modern world. Yes, maybe it will be more accessible in
the future because such a society can allow dropouts to live such a life, as many can live well
without a regular job thanks to automation. We believe that the realization of the digitalized
society will lead to the individual being in the center. Everyone will have the opportunity to
realize their individual lives and live their true will.
■ 24.15 In a Nutshell
How can data professionals use their skills to explore analytical ways to
fight climate change?
How can we find ways to reduce existing carbon emissions through data
analysis?
How can we support the process of removing carbon emissions from the
atmosphere using our data science skills?
How can we benefit from modeling the effects of climate change on our
environment using digital twins?
How can AI help to reduce the risk of more radical approaches, such as
geoengineering, to support climate transition?
■ 25.1 Introduction
Newcomers to data science use cases might wrongfully assume that data scientists focus
exclusively on maximizing profit using data: replacing humans with chatbots to reduce
costs, for example, or using AI to create personalized customer offers to maximize sales.
These and other use cases indeed highlight how companies often strive to use data science
and artificial intelligence to maximize their financial success. Yet, at the same time, many
data scientists are dedicating their work to “the greater good,” such as addressing the UN’s
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
762 25 Climate Change and AI
Governments regulate how much CO2 companies are allowed to emit. The
maximum of CO2 a company is supposed to emit is called a cap. Companies
who exceed their cap can purchase carbon credits from companies who are
below their cap. In other words, this system, also called cap and trade,
enables companies who optimize their carbon emissions to create new reve-
nue streams by selling their unused quota for emissions and leads to new
cost centers for those who ignore their carbon emissions.
Such regulations create incentives for companies to decrease their carbon
emissions, but such regulations can also be misused. We also want to
address greenwashing later in this chapter to show how businesses may
cheat to gain a greener image.
1
https://sdgs.un.org/goals
2
https://ai4good.org/
3
Herzog H. (2018): Carbon Capture, MIT Press, September 11, 2018, ISBN: 978-0262535755
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.2 AI – a Climate Saver? 763
Anthropogenic emissions – that is, those resulting from human activity – cause climate
change. To fight climate change, we must, therefore, cut this type of emissions. But what
exactly do we want to reduce? The following chart in Figure 25.1, reproduced from “How to
Avoid a Climate Disaster?” by Bill Gates, quantifies GHGs by origins.
Analyzing the origins of GHGs gives us room to speculate: Is reaching the net zero goal in
the aviation industry more difficult than in retail? Would we be carbon-neutral if we all
drove electric vehicles? How much can we reduce climate emissions by building more inno-
vative and energy-efficient offices?
We can assess an organization‘s emissions through carbon accounting. As with any other
accounting process, carbon accounting uses a structured process to measure and monitor
how many GHGs a company emits.
4
Bill Gates: How to avoid Climate Change
5
https://www.ipcc.ch/sr15/chapter/spm/
6
https://www.nytimes.com/2023/05/01/technology/ai-google-chatbot-engineer-quits-hinton.html
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
764 25 Climate Change and AI
Figure 25.1 How much greenhouse gas is emitted by the things we do?7
7
Copyright: Bill Gates, How to Avoid a Climate Disaster?
8
Copyright: GHG Protocol, https://ghgprotocol.org/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.3 Measuring and Reducing Emissions 765
Figure 25.2 depicts the GHG Protocol, one of the oldest and most widely used GHG account-
ing standards for calculating and reporting carbon footprints.9 In this protocol, GHG emis-
sions are categorized into three scopes, as follows:
Scope 1 emissions are direct emissions from sources controlled or owned by an organi-
zation. These emissions can arise, for example, from running machinery to make prod-
ucts, driving vehicles, heating buildings, or powering computers.
Scope 2 emissions are indirect emissions created by the production of the energy that an
organization buys. Depending on the physical location of the facilities, supplies may orig-
inate from fossil fuel sources. Installing solar panels or sourcing renewable energy would
reduce a company’s Scope 2 emissions.
Scope 3 emissions are indirect emissions but differ from Scope 2. They include the emis-
sions produced by suppliers making the company’s products or by customers using the
company’s products. Scope 3 emissions are usually the hardest to measure and tackle.
Unfortunately, they often account for more than 70% of an organization’s carbon footprint.
Companies will need to cut emissions across all three scopes to meet internationally agreed
goals on global warming. Yet these scopes give us room to speculate again. Should we in-
clude in Scope 3 the emissions from employees driving to work? How much would it affect
the Scope 3 balance if they worked from home instead? And if we agree to mark drives from
home to work as Scope 3 emissions, how do we want to collect and audit this information?
Scope 3 emissions might get even more complex if we analyze scenarios related to supply
chains. Imagine a vendor providing a rare intermediary product to a company that gets
integrated into the company’s final product. What happens if new government regulations
force the company to report carbon emissions of all intermediary products used and the
vendor fails to report the emissions? If the vendor can supply carbon emission data, how
can we ensure this information is correct, and who is liable if the information turns out
wrong? And as carbon emissions of products may change over time, how do we organize a
change management process?
Let’s look at how we can use data to measure and reduce GHG emissions.
9
https://www.onetrust.com/resources/ghg-scope-1-2-3-emissions-infographic/
10
https://www.sciencedirect.com/science/article/abs/pii/S2352710223017783
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
766 25 Climate Change and AI
■ 25.4 Sequestration
The most vital metric for carbon emissions is parts per million (ppm), referring to the
amount of carbon dioxide particles in the atmosphere. Scientists have been tracking ppm
for decades, and at the current global average concentration of over 400 ppm, it is far higher
than before the Industrial Revolution, as shown in Figure 25.3.
11
https://www.sciencedirect.com/science/article/abs/pii/B9780323995030000016
12
https://www.sciencedirect.com/science/article/abs/pii/S0196890424000724
13
https://www.sciencedirect.com/science/article/abs/pii/S0950705121002756
14
https://www.sciencedirect.com/science/article/abs/pii/S221313882300543X
15
https://www.arinto.io/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.4 Sequestration 767
In the previous section, we highlighted that we can reduce or eradicate anthropogenic emis-
sions by becoming aware of how we create GHGs and changing our behavior accordingly. If
companies produce their goods in a less environmentally harmful way, or if we find ways to
avoid activities that lead to high carbon emissions, the global ppm may continue to rise, but
slower. Unfortunately, even if our behavior were immaculate and we could cease to emit
GHGs altogether, we would not be able to remove existing GHGs from the atmosphere. In
fact, existing GHG emissions can remain in the atmosphere for a long time, from months to
millennia.17 This means we need to think about how we can address a problem that has
already been created over the last decades: We need to explore methodologies to undo exces-
sive emissions from the past.
Carbon sequestration, the process of capturing and storing atmospheric carbon dioxide, is
one possible solution. It can help to reduce the amount of carbon dioxide in the atmosphere
and, in doing so, impede global climate change.18 Carbon sequestration usually involves
capturing and storing carbon that has the immediate potential to become carbon dioxide
gas. This can occur naturally, such as in the absorption of carbon by plants and via human
intervention by capturing gases before they are emitted into the atmosphere or by removing
existing gases already in the atmosphere.
There are two main types of carbon sequestration: biologic(also called biosequestration) and
geologic. Biological sequestration refers to the capture and storage of CO2 through natural
processes. Plants and fungi absorb carbon dioxide from the air and bind it into biomass.
Besides terrestrial sequestration, biosequestration also includes blue carbon sequestra-
tion, which refers to capturing carbon emissions through marine organisms, comprising 55%
of the total biologically captured carbon 19. Geologic sequestration involves capturing carbon
dioxide from industrial sources, such as power plants, and storing it in underground geo-
logic formations.
16
Copyright: https://www.climate.gov/news-features/understanding-climate/climate-change-atmospheric-carbon-
dioxide
17
https://www.theguardian.com/environment/2012/jan/16/greenhouse-gases-remain-air
18
https://www.usgs.gov/faqs/what-carbon-sequestration
19
https://www.sciencedirect.com/science/article/pii/S2772656822000367
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
768 25 Climate Change and AI
Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to
create oxygen and energy in the form of sugar. For many people, this process represents life
itself. And yet, the global rate of deforestation is alarming. Every year, an area of Portugal’s
size is cut down.20 This harms the environment in two ways: First, we destroy a carbon sink,
thus releasing tons of trapped carbon; Second, we use former forest lands for activities that
create additional GHGs. In addition, this leads to a vicious cycle: The warmer it gets, the
more CO2 is likely emitted through wildfires, which will heat Earth even further. Have you
ever explored how hot asphalt can get on a sunny day? Albedo is the fraction of light that a
surface reflects. Bright light has the highest albedo, leading to another vicious cycle in
which the less ground is covered by ice, the less efficiently the sun is reflected, and the
warmer it gets.21
So, how can we use data science and AI to support biological sequestration? Let’s look at
some use cases.
Wildfire Prevention
The media regularly reports about wildfires in the summer. Scientists agree that climate
change is causing the higher risks of wildfires. Computer vision can detect high-risk areas
and take preventative actions. Using data, we can also get refined weather forecasts. Simu-
lating wildfires in digital twins can also prepare firefighters for catastrophes. They would
then know where they must fight fires to minimize their spread. If there is supervision of
an area, this also reduces the risk of arson, as arsonists will be more careful if there is sur-
veillance.
20
https://ourworldindata.org/deforestation
21
https://climate.nasa.gov/explore/ask-nasa-climate/2594/sun-and-sky-snow-and-ice/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.4 Sequestration 769
Since the Industrial Revolution, we have extracted and burned enormous quantities of fossil
fuels from oil and gas fields to gain energy and emitted tons of carbon emissions into our
atmosphere. In its simplest form, geological sequestration involves capturing carbon emis-
sions and injecting them into depleted oil and gas fields or other rock formations.
We can capture emissions while they are being produced; imagine burning fossil fuels and
venting the CO2 into containers, such as vast depleted gas fields. Related to that but still
different, Direct Air Capture (DAC) is a process using mechanical devices to remove GHGs
that are already in the atmosphere. Imagine a DAC as an array of vast filters; air passes
through them, and liquid solvents or solid sorbents react with the CO2 in the air, conse-
quently removing the GHG particles. Some may call this a paradox: We invest energy to run
DACs to undo what was previously done in pursuit of gaining energy. We must acknowledge,
however, that we have a high number of ppm in the atmosphere due to past negligence, and
now we need to think about how to reduce that number again. If we are not able to remove
these emissions, we will not be able to meet the net-zero goal. Removing past emissions is
especially important to achieving net zero, given that removing all GHG emissions from
current industrial and human activities will be almost impossible. Researching leading
companies in that field, such as Climeworks and Carbon Engineering, will provide inter-
ested readers with more information.
The most essential metric for the carbon sequestration business is the cost of removing
one ton of GHG from the air. Various sources put the current figure at between 250 and
600 USD.22 If we can reduce this cost to below 100 USD per ton, the business case becomes
viable.23 Some see this number as the holy grail of fighting climate change, as it would gen-
erate a surge in commercial interest in building sequestration facilities since they would
offer guaranteed profits for years. Many researchers24, however, highlight that reducing
costs is tricky. One challenge identified in papers is that it is hard to cut energy require-
ments and that the sorbents used to capture the GHGs are expensive.25 Let’s look at some
use cases again.
Optimizing costs
We’ve just seen that lowering the cost of reducing airborne GHGs is the most critical metric
for generating viable carbon sequestration businesses. Data scientists may help achieve this
goal by optimizing the supply chains that build and operate carbon sequestration installa-
tions. For example, if the prices of raw materials are reduced, the overall costs will be re-
duced in tandem.
Most of the costs, however, derive from the significant energy required to run carbon se-
questration facilities. We can optimize these costs by operating only during times of low
energy prices, such as when solar energy is provided during peak sun times or when the
general energy demand in a region is low. Data scientists can build models to forecast
energy prices, including factors such as weather or energy demand.
22
https://www.wri.org/insights/direct-air-capture-resource-considerations-and-costs-carbon-removal
23
https://www.protocol.com/climate/carbon-dioxide-removal-cost-target
24
https://www.iisd.org/articles/deep-dive/why-carbon-capture-storage-cost-remains-high
25
https://www.sciencedirect.com/science/article/pii/S2590332223003007
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
770 25 Climate Change and AI
26
https://climate.mit.edu/ask-mit/what-risk-co2-stored-underground-after-carbon-capture-will-escape-again
27
https://www.ft.com/content/6f6e808d-fe51-4038-8963-cd0497ca7f57
28
https://www.sciencedirect.com/science/article/abs/pii/S0048969724002195
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.6 Geoengineering 771
losses for winemakers. More importantly, failures of food crops will cause poverty and hun-
ger, which may in turn lead to war and mass migration. Finally, the loss of plant life can
cause an uptick in CO2, which leads to ocean acidification and yet further loss of biodiver-
sity. In short, the consequences of climate change can be seen as a vicious cycle, where one
disaster leads to another. Let’s look at possible data science use cases.
■ 25.6 Geoengineering
Over 60 million years ago, a comet hit the semi-island of Yucatan. The impact threw so much
ash into the atmosphere that it blocked out a large amount of sunlight, which caused the ice
age that killed the dinosaurs.
This scenario might inspire entrepreneurs who want to fight climate change. What if we
could reverse the heating of our planet by putting molecules in our atmosphere that act like
“sunlight blockers”? Although many might doubt this can be achieved, some entrepreneurs
are exploring it as a real possibility.31 And this type of activity has a name: geoengineering.
29
https://www.sciencedirect.com/science/article/abs/pii/S0022169423002093
30
https://www.sciencedirect.com/science/article/pii/S221209632200078X
31
https://makesunsets.com/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
772 25 Climate Change and AI
Geoengineering can be seen as one of many ideas that do not follow traditional approaches
and where entrepreneurs instead get to think out of the box. Many geoengineering tech-
niques sound not only wild but downright dangerous. The curious reader can visit Geoen
gineering Monitor33 to explore various techniques and learn the extent to which they are
purely theoretical versus ready to be implemented. From microbubbles to cirrus cloud thin-
ning, everyone can decide whether these perhaps crazy-sounding ideas can succeed.
32
https://www.xprize.org/about/mission
33
https://www.geoengineeringmonitor.org/technologies/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.7 Greenwashing 773
■ 25.7 Greenwashing
These days, there is much talk about how Millenials appreciate ESG (environmental, social
and corporate governance) values.34 As the name implies, these values go beyond climate
concerns and include other concerns, such as inequality.35 Many people—especially young
workers— are neither willing to work for nor purchase from companies that do not present
a liberal democratic agenda.
Whether the executive management believes in liberal democratic values or not, a compa-
ny’s claim to have strong ESG values can be good for business. Some companies might be
tempted to convey false or misleading information about how their products and practices
are environmentally sound. Imagine a company claiming its products cause fewer emis-
sions than they do. The Volkswagen emissions scandal is one of the most well-known cases
of greenwashing. Greenwashing goes beyond manufacturing products. Imagine a company
branding itself as a protector of a national park while there is, in reality, no harm to that
park. That company might want to claim carbon credits from the government for their
alleged protection of that park. Another greenwashing fraud would be if such a company
claims more carbon credits from this park than it absorbs CO2 or if it manages to sell this
carbon credit more than once.
34
https://www.shrm.org/hr-today/news/hr-news/pages/survey-esg-strategies-rank-high-with-gen-z-millennials.aspx
35
https://corporatefinanceinstitute.com/resources/esg/esg-environmental-social-governance/
36
https://www.visualcapitalist.com/life-cycle-emissions-evs-vs-combustion-engine-vehicles/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
774 25 Climate Change and AI
■ 25.8 Outlook
Fighting climate change with data has surprisingly many opportunities for data scientists
and AI experts, from using computer vision to prevent wildfires to using digital twins to
analyze the possible impact of geoengineering projects. Figure 25.4 presents an extensive
mapping of data disciplines to practical applications, and organizations like climatechange.
ai38 also support the development of machine learning cases to fight climate change. In
short, data scientists looking for a deeper purpose in life, who want to use their skills for
good, might find in climate change action a field that gives them enough challenges to one
day say, “I have done something.”
In this chapter, we have introduced many approaches that might lack economic viability,
such as DACs and graphene production. Decades ago, engineers claimed that photovoltaic
and wind energy were not economically viable and that governments would forever have to
subsidize them. Today, modern solar parks generate electricity with low operating costs
that seemed to be unimaginable years ago. Maybe the key question for data scientists and
climate innovators is not so much what else we can do but rather how we can make some of
the existing approaches economically viable. Here again, we see the applicability of a lot of
data science methods for industrial use cases: Predictive maintenance reduces the number
of plant downtimes, and the earlier we are able to detect potential problems and prevent
them from happening, the more we can reduce costs. Optimizing supply chains and trans-
port logistics optimizations may reduce the costs even further.
If an intelligent alien lifeform visited our planet, it might conclude that humankind is grad-
ually terminating itself. In the end, we have collected enough evidence that we are heading
toward a climate disaster, and we are still too passive to trigger enough actions to prevent
this catastrophe. It might still sound strange for some, but maybe we need machines at some
point to save us from ourselves.
37
https://lune.co/blog/what-is-double-counting-in-carbon-offsetting-and-why-is-it-important/
38
https://www.climatechange.ai/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
25.8 Outlook 775
39
Copyright: https://dl.acm.org/doi/fullHtml/10.1145/3485128
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
776 25 Climate Change and AI
■ 25.9 In a Nutshell
In this book, we have classified a ‘Data Scientist’ as a person that creates new knowledge
through data analysis. In addition, a data scientist often acts as an intrapreneur who influ-
ences strategic decisions as an ‘entrepreneur within the company.’
Data science experts and data platforms are only two of the building blocks required for
successful data science projects. The corporate culture and its ability to change are primar-
ily responsible for the success or failure of these ventures to gain new insights. Without a
willingness to rethink existing processes and structures, nobody can implement the knowl-
edge from this book satisfactorily.
PC Welt magazine has published many quotes from influential figures who didn’t believe in
change in their day.1 One crowning example is Steve Ballmer when he laughs about the
iPhone and emphasizes that business users “would never buy a phone without a keyboard.”
One recipe for “making the impossible, possible” is to believe in yourself, question what
exists, and not let the establishment dissuade you. In the biography of Steve Jobs by Walter
Isaacson, the author claims that the Apple founder simply ignored objections from engi-
neers that his wishes were technologically unfeasible. Isaacson called this trait of Jobs his
“reality distortion field.” The consequence was that this pressure sometimes inspired em-
ployees to go above and beyond. 2
In this book, we describe the result of the transformation of companies towards data science
as a digital society. The previous chapters explained how this could change how we live and
how targeted data analysis can turn entire industries upside down. Amara’s Law is also
essential here:
“We tend to overestimate the effect of a technology in the short run and underestimate the
effect in the long run,” – Roy Amara.
In times of rapid change, many citizens are also afraid of moving away from traditions. The
possible misuse of data is only one of multiple concerns. Many people fear no longer having
a place in the digital society. Digital natives who have grown up with innovations are more
adaptable than people who at some point trusted that they would retire with their first job.
It would be unrealistic to speculate that, in a short time, data science will digitally transform
the world. But, at the same time, we use technology standards today that were not imagin-
able in science fiction films of the past.
Sometimes, in this context, data scientists also have the task of breaking down old thought
patterns in companies. They have to convey change as an opportunity for the company to
put itself in a position to generate new knowledge from data.
Paradoxically, refusing to change because of security concerns leads to precisely the oppo-
site. Those who do not move with the times risk their very existence. There is no such thing
as “too big to fail.” Studebaker once had the world’s largest car manufacturing plant; Today,
you’ll find Studebakers in museums. The list goes on and on: Horten, Saba, Commodore, and
many more. Courses, such as Critical Business Skills for Success3, use case studies like
Netflix vs. Blockbuster to illustrate that leading companies with the wrong strategy can lose
their dominance. Some analysts even predict that the German automotive industry is at risk
of losing its position on the market as they fall behind in the race to develop autonomous
cars. 4
Who will perish or who might persist is not the subject of this book. Our task in this context
is to point to the notion of Science in Data Science. Science creates new knowledge, and new
knowledge leads to change. Companies that ignore change will not remain competitive.
1
www.pcwelt.de/ratgeber/Die_spektakulaersten_Fehlprognosen_der_IT-Geschichte-6948150.html
2
Isaacson, W.: Steve Jobs. Simon & Schuster, 2011
3
Conner, M.C.: Critical Business Skills for Success. Teaching Company, 2013
4
https://derletztefuehrerscheinneuling.com/2019/02/28/die-erschreckende-ambitionslosigkeit-deutscher-
hersteller-beim-autonomen-fahren/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.1 Data-Driven Mindset 779
One question that came up in recent years was whether Data Science tech-
niques might increase or decrease discrimination. The claim was that if you
provide unbalanced training data to AI models, the model will perform better for
that group. For example, suppose your training data consists of people with one
skin color in a facial recognition program. In that case, the trained model may
fail to recognize people of another ethnicity. Not only can this have detrimental
effects on the well-being of the underrepresented group, but it can also cause
severe consequences for the AI-program creators.5
Friends of data science often highlight that data scientists have the means
and the responsibility to remove bias because, in the end, sexual orientation,
religious beliefs, skin color, age, nationality, or gender are just parameters in a
formula. This approach to eliminate these parameters from analysis would align
with current political ‘hashtag movements’ goals.
Even if we could prove our algorithms were unbiased, unwanted side effects
could arise. What if, for example, we learned through data that genetic
differences have more significance than we thought? Could scientific theories
then inspire political ideas based on race or other genetic traits?
Data Science has the potential to help people make completely unbiased
decisions. For example, say we use algorithms to help automate candidate
selection during hiring processes: we could then remove all non-relevant
parameters from the decision-making process and that factors such as gender
or ethnicity cannot affect hiring decisions anymore. However, it might still take
a long time to replace a recruiter as machines cannot interpret interpersonal
relationships and various human traits, especially for personnel decisions.
Communication challenges and bureaucracy in companies grow with the number of em-
ployees. The associated inertia makes it difficult for corporations to react to new develop-
ments. Numerous articles report an increasing number of people who are dissatisfied with
their jobs in corporations. Many feel like a small wheel in the system.6
In an oppressive environment, a digital transformation driven by data science seems like a
possible cure-all for any form of corporate disenchantment. After all, strategy consultants
promise a different way of working. Data science is supposed to break up existing structures
and reshape them. So let’s have a look into some possible pitfalls during a transformation.
5
https://www.forbes.com/sites/mzhang/2015/07/01/google-photos-tags-two-african-americans-as-gorillas-
through-facial-recognition-software/
6
https://www.zeitgeistich.de/zufriedenheit-im-grosskonzern/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
780 26 Mindset and Community
For the implementation of a digital transformation, companies often seek the support of
external IT service providers. However, large consulting firms have the reputation of selling
a lot of management overhead in their services and not always acting sustainably in the
customer’s interests.
“Many of our technicians are not happy with management’s decision about the
contract with consulting firm X. Many of the representatives from the consulting
firm are still relatively young, and in addition, their high daily rates seem over-
priced.
One of our managers has worked for consulting firm X in the past. If the project
is in trouble, he and his pals from the consulting firm X will do everything to get
back on track, even at the risk that X will have no profit in that project. If, in the
end, the project with consulting firm X still fails, many stakeholders will high-
light the project’s complexity and that even consulting firm X cannot do magic.
However, if the decision-maker had hired a start-up, who might be a bit more
agile and innovative than consultant firm X, and if the project had failed, then
our stakeholders would have criticized the decision to hire a start-up for a large
enterprise project.”
For start-ups, it is a challenge to manage large projects which demand many different com-
petencies. For example, a complete data project often requires hardware, computer network,
operating system, security, and data experts. And each of these experts must also have
specific, sometimes specialized technical knowledge.
Big consulting firms are often not popular with engineers, who often claim that such firms
use their market power to get into large companies. People feel more sympathetic toward the
underdog, particularly the innovative start-up of ambitious young entrepreneurs without
the management overhead. However, we also have to look at the downsides of those start-
ups.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.2 Data Science Culture 781
“Start-up Y has promised to solve all our problems. They call themselves a ‘Bay
Area innovator’ and highlight their years of experience as a leading AI company.
However, during our engagement, it turns out that hardly any of them have
worked on a major AI project yet, and we found some of the profiles of their key
employees advertised as professional front-end developers for other clients.
Even the San Francisco address is just a mailbox address; most employees
work from Eastern Europe. So, though they try very hard and their rates are
slightly cheaper than a larger consulting firm, we end up being beta testers.”
It would be wrong to claim that start-ups generally overestimate themselves. Nor would it
be right to condemn all large consulting firms as overly bureaucratic organizations that only
present colorful PowerPoint slides. If there is one thing to conclude then, maybe it is that
each organization can develop a particular form of arrogance which leads them away from
every company’s natural main goal to bring value to their clients. The Dunning-Kruger
Effect mentioned later will go more into detail on this.
Many large organizations have reorganized and created organizational units which have
been decoupled from corporate politics and given the name ‘labs.’ These labs provide a
culture of innovation, which means working with universities and establishing a so-called
‘Tribe’ culture, as popularized by Spotify.7 Equally important is the workplace itself. Instead
of boring offices, there are workspaces specifically designed to facilitate a friendly, creative
working atmosphere.
If start-ups want to carry out a large project that they cannot manage independently, they
can join forces with partners. As a ‘bidding consortium,’ they can agree to distribute respon-
sibilities. However, partnerships also entail risks. Bidding consortiums fail if one of the
companies wants to increase their workload during the project and begins to “fish in other
people’s waters.” The question of trust also arises: A company can control its services and
quality standards, but how do they ensure that the partner company also treats the project
with the same professionalism?
Keiretsu is a cooperative model that helped Japan rebuild its economy after World War II.
Each firm in a conglomerate specializes in one particular product or service, and, in order
to ensure a collaborative partnership, each firm also holds shares in the other participating
7
https://www.atlassian.com/agile/agile-at-scale/spotify
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
782 26 Mindset and Community
firms. Thus, all companies will increasingly act in each other’s interests and positively in-
fluence their partners’ quality and processes.
Large corporations have the means to invest a lot of resources to win contracts with clients.
However, especially if their goal is to increase revenue, they sometimes accept making no
profit initially and are very generous with free initial prototypes to bind the client. This
aggressive behavior makes it difficult for small companies to compete.
The “SME Keiretsu” model outlined above can be used in the B2B sector to oppose the
corporate “Goliath” as an “agile David” as smaller companies share the risks and would also
be able to risk more as if they were alone.
Many data experts are turning their backs on the corporate world to become
independent experts. In doing so, many ask themselves whether they, as data
experts, should try to sell their knowledge as a service to clients (consulting)
or whether they should develop a product.
In consulting, you benefit from an immediate cash flow when you win a project.
The downside of consulting is that consultants often end up doing tedious
work. Many consultants exchange long-term job security against higher short-
term pay as day rates result in more money than a monthly salary. However,
those who enter product development must first deal with the venture capital
market since creating a product takes longer and must be financed. Product
companies also require more budget for product branding and sales.
If you want to understand how to do a start-up with an IT product, read the
book “Disciplined Entrepreneurship”8 and research venture capital. Be also
aware that it is hard to build a product without the support of others, whereas
as a consultant, you can work alone if needed.
8
https://www.amazon.com/Disciplined-Entrepreneurship-Steps-Successful-start-up-ebook/dp/B00DQ97TWO
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.2 Data Science Culture 783
What influence does the culture and mentality of a country have on its working environ-
ment? For a start, a country’s legislation and corporate culture provide the framework
within which companies can operate.
One often hears that start-up culture is lacking in some EU countries. Highly qualified
talents prefer the security of a salaried existence instead of realizing themselves as entre-
preneurs with new ideas. Many also complain that bureaucracy and lobbying are more
important than progress and that some countries are stagnant. Often, young people receive
too little support to pursue a future-proof career by studying STEM subjects. Critics accuse
the EU countries of promoting a culture of fear, which leads many people to view any kind
of progress negatively.
Critics of digitalization often speak of predatory capitalism and criticize that American
companies earn vast sums by snooping on people and do not even want to pay appropriate
taxes for this.
9
https://agilemanifesto.org
10
https://www.amazon.jobs/en/principles
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
784 26 Mindset and Community
Others emphasize that a Western-centric view of the world overlooks that China and other
Asian countries are on the verge of taking over the leading technical role in the world. Books
like “AI Superpowers: China, Silicon Valley, and the New World Order” by Kai-Fu Lee, men-
tion Europe only at the margin.11 That alone should be enough to make people stop and
think.
Example: Armenia
The example of Armenia shows how politics can exert influence. Some call it the
“Silicon Valley of the former Soviet Union.”12 Others refer to the capital Yerevan
as “Silicon Mountain,” which is listed by the Enterprise Times as “the best
non-American hub for an IT career after San Francisco, Austin, and New York.”13
Armenia was also named Country of the Year 2018 by the Economist.14
So how does a country previously perceived as a former Soviet republic
sandwiched between two hostile states manage to emerge as a potential new
AI supernation with the most significant venture capital outside the US?
There are three reasons for this:
Diaspora: Numerous Armenians have emigrated worldwide due to the 1915
genocide, many to the United States. California companies run by diaspora
Armenians are setting up branches in Armenia.
Education: Armenia invests in its youth. The showcase model here is Tumo15,
a training center for teenagers. Eighteen-year-old serial entrepreneurs are not
uncommon in Armenia.
Geography: Without mineral resources and access to the sea, the country is
mainly dependent on the service sector.
Armenia is not the only country to create a thriving start-up culture. With low
taxation, easy and fast migration for qualified engineers (via Digital Nomad
Programs), and other incentives, many countries make innovation hubs where
new unicorn companies can be born. Furthermore, economists even claim that
some countries sabotage themselves with restrictive migration policies for
highly qualified engineers.
Policymakers can invest in training, reduce bureaucracy and, most importantly, create
framework conditions that make it easier for people to start up.
Companies themselves often follow a pets or cattle strategy. The word “cattle” means re-
placeable animals. A farmer may lament the death of a cow as a loss of income, but he
doesn’t care about the cow itself. Pets, on the other hand, have value to the owner. So when
Buddy, the family dog, dies, it can be a family tragedy, and family members will miss the old
dog, even if they get at some point a new dog.
11
Lee, K.-F.: AI Superpowers: China, Silicon Valley, and the New World Order. Houghton Mifflin Harcourt, 2018
12
https://thenextweb.com/asia/2017/03/17/armenias-rising-tech-scene-new-silicon-valley-former-soviet-union/
13
https://www.enterprisetimes.co.uk/2018/10/25/10-best-cities-for-tech-career-opportunities/
14
https://www.economist.com/leaders/2018/12/22/the-economists-country-of-the-year-2018
15
www.tumo.org
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.2 Data Science Culture 785
In a “corporate cattle strategy,” employees mean turnover. The management keeps costs low
and sees employees just as a resource to make a profit. In extreme cases, cattle-strategy
companies deduct bathroom or cigarette breaks from employees’ work hours. If an efficient
employee quits, you lament the loss of revenue but not the person’s departure.
In contrast, the pets culture means that employees are the focus. A company invests in
them, treats them well, and makes sure they stay motivated. Of course, every profit-oriented
company expects to make money with the employee. Still, the company understands that
only a happy employee is the most profitable employee in the long run.
But some companies are still in the Stone Age when it comes to employee management. It
is precisely these companies that wonder why they cannot fill any data scientist positions.
What we can learn from Google and Co. is to promote high potentials by all means. Above all
else, it’s about recruiting people with the right mindset.
“People don’t buy WHAT you do; they buy WHY you do it.” – Simon Sinek
The “Work Hard, Play Hard” culture means that people are willing to do what it takes to
achieve a shared vision because they are passionate about the purpose and culture of the
company. It is about employees believing both in an idea and also in the “why.”16 In other
words, the most passionate employees are the most profitable.
If you want to dig deeper into the topic of company culture, look at how Spotify, Google, and
others have built their company culture and consider what actions could move your com-
pany forward.
It is not always easy for European companies to jump over their shadows. Too
many security experts impose regulations that, strictly speaking, overshoot the
mark, even if data protection itself is welcome. Where there is light, there is
shadow. Particularly in ambitious countries, corporate entities often ignore
ethics. Europeans are a lot more concerned about civil rights than in other
countries, which has certainly worked in our favor in other areas in the past
(take the history of environmental protection as an example).
Another concern of many is a culture of “mediocracy” in some European
countries. For example, many Europeans expect to work from 9 to 5 without
any overtime and still be competitive, while in Asia, some ambitious employees
work 80 hours and more.
What remains is that Europeans have to face the facts. AI is here, and Europe
can stand on the sidelines as a spectator and leave the development to others
because it is scared to death. But Europeans can also actively shape the future;
they can face the pros and cons of digitization and data science and drive the
development of data science and the data landscape. Here, it makes sense to
think a few dimensions bigger in Europe as well.
16
Sinek, S.: Start with Why. Portfolio, 2009
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
786 26 Mindset and Community
■ 26.3 Antipatterns
Antipatterns are behavioral patterns that lead to the opposite of the desired state; in other
words, they are ways of thinking that can cause data projects to fail. However, they all have
one thing in common: leaders can change them by establishing an alternative corporate
culture.
Some companies consider Data Science to be exclusively a technical topic. The management
expects that if you put data and algorithms in, you can just harvest results.
Let us assume a company plans to become data-driven, having learned from strategy consul-
tants that this is the only way they could survive in the long run. The strategy consultants
also envision a bright future for the company as a data-driven enterprise. Management,
therefore, hires highly paid data scientists and expects miracles from them. Management
calls them the company’s future; they get top-notch equipment, a workplace in a new build-
ing block, and access to all the data they need.
Imagine that among the company employees is the archetypical long-term employee: a sub-
ject matter expert (SME), who is sometimes described by colleagues as assertive and maybe
even annoying, but who has been working for the company for decades and knows all pro-
cesses in and out. During the assessment by the strategy consultants, this employee had
pointed out various details that contradict the vision of a bright future and fat bonuses,
which the strategy consultant had promised the management board. It’s no coincidence,
then, that this employee’s workplace is in the old building, far away from the newly formed
all-star data scientist team. Management considers the employee’s skepticism a danger and
believes their pessimistic attitude could unsettle the remaining project members. After all,
the point is to create something new.
“It was imperative to me to keep all the naysayers away from the project team. We finally
want to do something new without all those people telling us it will never work, just because
they do not believe in it.” – Department head of a financial company that later struggled
with its digital transformation.
The all-star data science team presents the first use cases they want to incorporate into the
company’s DNA. The slide deck is perfect, and the board loves it. The team also receives
much support from the strategy consultants who envisioned the transition to a data-driven
enterprise in the first place. Management calls the plan a fantastic approach that exceeds
their initial expectations. They even accept that they need to invest a bit more than initially
planned to create the company’s new future.
Months later, everyone is devastated. The data scientists’ ideas do not prove to be practica-
ble. Finally, in retrospective sessions, the domain expert can present their views; they can
explain, even without mathematics, why the plans of the all-star team have no practical
value. It becomes apparent that, had they had the SME’s input all along, they might have
explored more profitable options. It is a weak consolidation that the consultants’ newly
proposed strategy now contains some of the SME’s ideas, rebranded in their terminology.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.3 Antipatterns 787
Every company should always involve SMEs. Excluding those who know the details of the
operative business is like flying blind over unknown terrain. The danger of crashing is great
because even the most stable aircraft will not stand up to an unmapped rock.
This book already mentioned that business departments should own data science initiatives
and integrate IT as enablers. It is helpful to detail why the IT department should not take
care of data science.
IT must ensure that employees can perform their tasks to the best of their ability using
electronic processing. System uptime is an essential part of IT performance reports. If too
many users experience troubles using their day-to-day business applications, an IT depart-
ment may receive a negative rating. The best way for IT departments to avoid problems is to
limit access and available functionality. If the dumbest assumable users (DAU) have fewer
options to make mistakes, IT will have to fix fewer issues. The safest method is not just to
restrict access to applications but also their visibility. In the end, according to the Dun-
ning-Kruger effect, DAUs might be quite self-confident and ask for permission to use avail-
able applications if they see them in a software catalog.
Proof of Concepts (PoCs) as part of a data science project can cause conflicts. Data Scientists
need access to as much data as possible for their research. However, the strategy of IT
departments is to limit access to systems and data as much as possible to avoid potential
problems. In the worst case, the data science team may not even realize that the data they
need exists within the company.
One way around this conflict is the so-called ‘bi-modal IT.’ In this system, there is one IT
team whose task is to maintain operations, and there is another IT team that gives users as
much freedom as possible in a corporate lab. In addition, one way of providing legal protec-
tion is to use NDAs (Non-Disclosure Agreements), in which employees who work with
sensitive data agree to maintain confidentiality. In the banking environment, for example,
this has been the standard for a long time.
Not all employees always benefit from changes. Employees often see managers as people
who “only” manage resources and are interested in profits. So if they sell the idea to move a
company into a new era, employees fear that working in that new era might not include
them. Older employees, in particular, fear being put on the losing side by transformations.
Moreover, if business leaders communicate possible innovations poorly, then employees’
interest groups may run up a storm against the innovations. In some extreme cases, employ-
ees might even secretly sabotage changes in a company. There are many examples in history
where corporate transitions failed because the employees did not trust their bosses.
In the end, everything depends on the leadership skills of those who introduce change. It is
about creating the awareness that a company is not viable in the long run without change.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
788 26 Mindset and Community
Change is part of business, and fear of change is deadly. There are many books on what
makes leaders great, and we cannot cover every value. However, one value that most people
agree with is that authenticity is essential for every leader as we tend to trust people who
honestly address any topic.
If a leader presents a new corporate strategy before their employees and doubts its success,
how should they persuade the skeptics, who are afraid of losing their jobs? Many have seen
this scenario before: A consultant presents a brand new strategy to the company. The con-
sultant is fully confident, saying that everything needs to change and the future will be
great. However, when a business leader tries to present that vision to an audience, they
suddenly feel less confident once they realize they are also accountable for its success. It is
always easy to promise great things unless you risk losing your job in the event of failure.
In most organizations, change also does not happen overnight. So instead of trying to enforce
change, leaders should provide a strategy for a gradual transition that matches the corpo-
rate DNA.
In many companies, there is the stereotype of the “all-knowing expert.” They have already
seen everything; you don’t need to tell them anything. Everything that springs from their
thoughts must be reasonable and will surely solve all problems. Other ideas are considered
nonsense to this ‘expert’; they do not allow different opinions.
“Know-it-alls” can completely ruin data projects. The data world is complex, and technolo-
gies change quickly. No one can understand everything. The real gold for a company is
young, high-potentials who are learning new technologies. If guided by the right leaders,
these people will spread their wings; if managed by a know-it-all, they might be afraid to
speak up, as know-it-alls often cultivate a “fear culture.”
There are many situations in which money is wasted by designing platforms that originate
from a know-it-all. Worse, companies often throw money down the drain to keep a faulty
architecture from such an employee alive. Often it is too hard to admit that everyone trusted
a self-righteous person who had convinced them they were right. Unfortunately, history is
full of horrible examples of charismatic but flawed leaders who did not doubt for a second
that they were right and who took everything with them in their eventual downfall. In the
end, we tend to believe those who seem to be convinced. Unfortunately, according to those
who study the Dunning-Kruger effect, it is not always the intelligent people who have the
most substantial confidence.17
“It doesn't make sense to hire smart people and then tell them what to do; we hire smart
people, so they can tell us what to do.” – Steve Jobs
You can establish a values-driven culture to look after talented, introverted employees if you
run a company. Once you empower the right people, your company will succeed.
17
https://www.youtube.com/watch?v=wvVPdyYeaQU
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.3 Antipatterns 789
This antipattern overlaps with what we presented under “Resistance to change.” However,
there, we focussed more on internal resistance. In this antipattern, we talk about those
external consultants that make money by making you afraid.
Many consultants make a living out of making systems secure. And those who make a living
from it do well to be able to point out dangers.
People who see doom everywhere are as dangerous as over-enthusiasts who ignore every
risk. Big Data is a naysayer’s paradise. There are countless risks that a pessimist can bring
up, such as
“Chinese hackers will steal your data.”
“Employees will sabotage every change.”
“Customers do not want any change at all.”
“It’s expensive and costs will explode.”
The tricky thing is that it is always easier to follow the recommendations of doomsayers
than optimists. If you do what alarmists say, it seems you cannot fail. If you ignore warnings
and a proposal for change fails, you might have to explain your past decisions. There is less
immediate risk involved in most cases if one listens to those who warn of dangers.
Although we should not become reckless, we should always be aware that the most signifi-
cant risk is stagnation. And stagnation is, in the end, what the naysayers propose.
26.3.6 Penny-pinching
Many IT managers want the most modern data platform. Because it’s open-source, of course,
it has to be the latest version to stay ahead. The teams are even encouraged to install beta
versions to remain at the forefront of innovation.
At the same time, however, cost-awareness is practiced like a religion. No money is spent on
external consultants because it seems cheaper if the company’s employees familiarize
themselves with the new system. HR disregards highly qualified applicants who do not
want to sell themselves short. The few underpaid company’s employees can watch training
videos if they run out of work, but there is no budget to send them to a conference where
they could meet other skilled professionals. There is also no investment in hardware, even
though the developers complain that the existing servers have too little RAM and the pro-
cesses break down.
“We don't need anything new as long as the old works somehow.” – The famous adage of a
post-war generation.
Unfortunately, the truth is that trying to cut corners can be expensive. A backend developer
with essential database experience doesn’t become a distributed computing expert juggling
petabytes of data overnight. Without incentives, many employees will work from 9 to 5, but
not go the extra mile. In the end, a culture of mediocrity is the death of innovation.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
790 26 Mindset and Community
We often get to know a culture of fear already in school. For example, teachers try to find
mistakes instead of empowering students to try out new things. In a culture where students
learn to go the safe path in order to avoid getting bad grades, we create an environment
where people believe it is always better to play safe. When they start to work as employees,
they are intimidated as their superiors could behave like teachers and grade them by their
number of mistakes. Therefore, they remain passive and just follow instructions instead of
making decisions on their own.
This mentality is poison for any creative environment. Companies that live in this culture
will not recruit the necessary talent to shift to a Data-Driven Enterprise. So-called high po-
tentials can usually pick and choose their employers, and correspondingly, few stray into a
“stone age” work culture. Companies with a culture of fear also harm themselves, as good
employees leave again. It is only a matter of time before these “fear companies” run into
serious problems.
In a blog post in BigData Insider, someone asked whether digitization would fail due to a
lack of human resources.18 Companies seem to struggle to acquire the necessary profession-
als to carry out their projects. Universities have reacted and are offering data science
courses, but these can only alleviate the scarcities; they cannot eliminate them.
For example, company X has made the strategic decision to digitize, and it is looking for
employees to support them in implementing the digitization strategy. However, the head-
quarters of company X is in an area where hardly any professionals are available, so they
are also open to looking for people outside of their region.
Many of us remember the famous 5/4/3 mode (five days of work, four days on-site, and
three nights sleeping away) and know that it doesn’t solve the problem (although this is
beginning to change with Covid-19). An initial on-site presence is helpful to build relation-
ships from both a professional and social point of view. However, once you are committed to
each other, the rest of the collaboration is a leadership issue. Well-managed, ambitious
people also take their job seriously remotely and will not abuse the trust placed in them.
The fascinating thing about this is that remote work allows you to involve experts distrib-
uted around the world who would otherwise not be available. In addition, experts from some
countries are also more cost-effective.
There are many examples where a flexible working model leads to success. However, it has
always been leadership and the team spirit, which has developed over time, that has spurred
people on to high performance: it was never managerial control.
18
www.bigdata-insider.de/scheitert-die-digitalisierung-am-big-data-fachkraeftemangel-a-540015/
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.3 Antipatterns 791
Many professionals dream of becoming Data Scientists or AI engineers. They see the poten-
tial to receive huge salaries, change the world, learn about secrets, and have a job with a
high reputation.
Even those who do not have the education or skills to work as data scientists got excited. In
blogs, consultants explain the fast track to becoming a Data Scientist. First, an aspirant has
to learn to use the correct expressions, have the right contacts on LinkedIn, and then stick
to the principle of “fake it until you make it” in the first job.
Your HR has finally recruited the new Data Scientist to bring the company into
the 21st century. Perhaps this new employee fits the perfect image of a nerd;
they know all the science fiction series, wear thick glasses, have a somewhat
daring hairstyle, and, of course, wear a T-shirt with geek slogans every day
(the extreme stereotype says this T-shirt might not even change as often as
hygienically recommended!). When talking to you, the new employee raves
about Machine Learning, AI and proclaims what they have already done . . .
You may not always understand it, but it seems like they know their stuff. Even
if the HR manager finds fault with the new recruit’s people skills, you quickly
agree that nobody recruited them for their charm.
Fast-forward to a few weeks later. Your super data scientist, unfortunately, does
not meet the expectations placed upon them. Although they explain to other
employees all the things that need to be changed, the HR manager realizes that
the recruited “rock star” is unfortunately not a “completer.”
One of your programmers has reported significant deficiencies in the Data
Scientist’s Python knowledge, and an inability to solve even simple problems
with SSH. Unfortunately, your new mastermind also seems to lack basic
statistical knowledge.
You start doing a little research. Then, after getting feedback from former
employers, you pray for the sake of peace in the team that the new recruit
didn’t brag about their high salary to colleagues, as this would most definitely
completely disrupt the group.
We advise all hiring managers to prepare job interviews carefully. Ask questions that go
into depth. Don’t blindly trust that your candidates know everything, even if they appear
confident. Explaining what products are on the market is easy; describing what they do in
detail is another matter.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
792 26 Mindset and Community
We all know that excellent resources are scarce. For some companies, it takes multiple
months to fill various technical roles. Above, we described the “Know it all” as a self-
righteous person who believes they can solve all problems. The “Swiss Army Knife” is
different; they are a multi-talented individual who can solve many things.
A typical Swiss Army Knife does not want to spend too much time in meetings. But, as a
solution-oriented person, she gets things done. She fits into various roles, and whatever task
you give her, she will solve it. She is also naturally curious to explore things she has never
done before.
Swiss Army Knives often have no problem overworking. Some of them define themselves
through work, and they work 60 to 80 hours a week voluntarily. Moreover, many leaders
have hit the jackpot with a swiss army knife that is not aware of their value. Rumors say
there are even swiss army knives that earn clearly below the average salary.
Even if the swiss army knife might seem obsessed with creating a perfect solution and if she
seems to be perfectly loyal, like every employee, there is always a risk of losing her. For
example, imagine you have to build a data platform. This swiss army knife is a data archi-
tect, security engineer, data manager, and systems architect in one person. Now imagine
losing her.
Be also aware that those engineers around the swiss army knife might have slowed down as
there was always someone who, in the end, solved all problems.
26.3.11 Over-Engineering
Sometimes a local relational database is enough to explore a company’s data. But, unfortu-
nately, many companies get carried away and invest in technologies that take them way over
the top. If you buy a Big Data solution such as Hadoop for large, unstructured data sets and
then try to shovel, say, manageable data sets into it, you’re acting counterproductively. It is
more important to know the exact requirements, understand what you need and want, and
invest in the appropriate solution.
It’s always good to learn from others. Talk to experts who have already worked
on digital data science projects and understand what went well and what went
badly.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
26.4 In a Nutshell 793
■ 26.4 In a Nutshell
Everything is changing.
Digitalization and data science are reshuffling the cards in companies. AI is
changing the way we work. Refusing to embrace this trend would have the
same consequences as if a company had denied the Internet in the 1990s. For
a successful data strategy, the mindset of the company must be right. Without
a vision and the courage to change, initiatives fail even with the best tools and
the most brilliant data scientists.
Corporate culture is essential.
Without a suitable corporate culture, every company will fail to introduce data
science. Good leadership is necessary to prepare the company for a digital
transformation.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Trustworthy AI
Rania Wazir
What is the current hard-law and soft-law framework for trustworthy AI,
especially in the EU?
Who are the possible AI stakeholders?
What is fairness in AI, and how is bias defined?
What are different metrics for measuring the fairness impacts of algorithms?
What are possible techniques for mitigating unwanted bias?
How can data and models be documented to improve transparency, usability,
and trust?
What are current methods for explaining model decisions?
The broad class of technologies that fall under the umbrella of AI – from expert systems
to machine learning driven solutions and data science applications – are revolutionizing
industry, pervading most sectors of the economy and beyond, and have the potential to
benefit the economy, society, and the environment. However, as has come to light in recent
years, these technologies also come with risks1,2,3. Public skepticism has been rising, as
examples of stereotyping and discrimination, concerns over worker’s rights, and detrimen-
tal impact on democratic principles and the environment have been exposed. In order for AI
technologies to continue enjoying rapidly growing adoption and realize their beneficial
potential, there will be increasing demand for AI-based systems that can be trusted. For AI
system providers, this trust translates into increased uptake of products where it is present,
and to legal and reputational harms where this trust is breached. In the chapter that follows,
we will explore in practice what trust in AI systems means, in particular in the context of
machine learning and data science solutions; who are the stakeholders that need to be con-
sidered; and some practical implementation steps that can guide the development process.
1
O’Neil, C., Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy, Broadway
Books, 2017.
2
Kate Crawford, AI Now Report 2019
3
Fundamental Rights Agency of the EU (FRA), Getting the Future Right
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
796 27 Trustworthy AI
Our task will be to try to weave the many disparate requirements together, to create a coher-
ent picture that can accompany the AI system development process from start to finish. We
start with the legal and soft-law framework, looking at prominent ethics guidelines, and
existing and upcoming regulations and standards. Trust will mean different things to differ-
ent AI stakeholders – and it is important to identify the various stakeholders involved with
an AI system in order to ensure its trustworthiness; we therefore take a brief detour into AI
stakeholder identification, before focussing on the issues of fairness in AI, and explainabil-
ity. This chapter can make no claim to completeness, but aims rather to deliver some guid-
ance to AI system providers and/or users who wish to create/deploy products that can be
trusted.
4
Jobin, Anna, Marcello Ienca, and Effy Vayena. “The global landscape of AI ethics guidelines.” Nature Machine
Intelligence 1.9 (2019): 389–399.
5
Jobin, Anna, Marcello Ienca, and Effy Vayena. “The global landscape of AI ethics guidelines.” Nature Machine
Intelligence 1.9 (2019): 389–399.
6
https://www.oecd.ai/ai-principles
7
https://unesdoc.unesco.org/ark:/48223/pf0000373434
8
https://www.unicef.org/globalinsight/reports/policy-guidance-ai-children
9
https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai
10
https://ec.europa.eu/info/publications/white-paper-artificial-intelligence-european-approach-excellence-and-
trust_en
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.1 Legal and Soft-Law Framework 797
ance for fostering the development of Trustworthy AI. They propose the following 5 main
principles11:
1. Inclusive growth, sustainable development and well-being. Poses a general require-
ment for Trustworthy AI to be beneficial: enhancing human capabilities, reducing
inequalities, and protecting the environment.
2. Human-centred values and fairness. A Trustworthy AI needs to respect rule of law and
human rights, including the right to freedom, the right to dignity and autonomy, the
right to privacy and data protection, and the right to non-discrimination.
3. Transparency and explainability. Requires responsible disclosure of information
about the AI system, in order to foster general understanding of such systems; make
stakeholders aware of their interactions with an AI system; and allow those affected by
an AI system to understand and contest its outputs.
4. Robustness, security and safety. Entails traceability for datasets, processes and deci-
sions; as well as appropriate risk managment measures to address risks such as safety,
IT security, privacy, and bias, during each phase of the AI system lifecycle.
5. Accountability. All actors involved in developing, deploying or operating AI systems, in
accordance with their role, should be held accountable for the proper functioning of the
AI systems, including ensuring that the above requirements are met.
The EU High Level Expert Group on AI has an even more extensive list of requirements for
a Trustworthy AI, this one addressed to AI system developers, providers, and users.12 A
Trustworthy AI needs to be legal, ethical, and robust, and should satisfy the following re-
quirements:
1. Human agency and oversight. Including fundamental rights, human agency and
human oversight.
2. Technical robustness and safety. Including resilience to attack and security, fall back
plan and general safety, accuracy, reliability and reproducibility.
3. Privacy and data governance. Including respect for privacy, quality and integrity of
data, and access to data.
4. Transparency. Including traceability, explainability and communication.
5. Diversity, non-discrimination and fairness. Including the avoidance of unfair bias,
accessibility and universal design, and stakeholder participation.
6. Societal and environmental wellbeing. Including sustainability and environmental
friendliness, social impact, society and democracy.
7. Accountability. Including auditability, minimisation and reporting of negative impact,
trade-offs and redress.
The HLEG Guidance is perhaps one of the most practical set of guidelines available so far. It
provides a clear understanding of the reasoning behind the requirements, and information
on how to implement them in practice. Based on the guidelines, the group also developed
11
https://www.oecd.ai/ai-principles
12
High Level Expert Group on Artificial Intelligence set up by the European Commission, “Ethics Guidelines for
Trustworthy AI”, April 2019, p.14. Accessed from https://digital-strategy.ec.europa.eu/en/library/ethics-
guidelines-trustworthy-ai
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
798 27 Trustworthy AI
the Assessment List for Trustworthy AI (ALTAI)13, a tool to help AI system providers, devel-
opers, and users assess the extent to which their AI system satisfies the seven requirements
for a trustworthy AI.
27.1.1 Standards
The path from guidelines to practical implementation is long, and regulation and interna-
tional standards are necessary stepping stones. Several international standards organiza-
tions are actively involved in creating the necessary standards for ensuring Trustworthy AI:
IEEE Ethically Aligned Design: https://ethicsinaction.ieee.org/#series. The IEEE has its
own set of ethical guidelines, covering almost 300 pages14. This is supplemented by the
7000 Series of Standards, specifying specific aspects of ethical AI. The first two to be
published cover general principles of ethical design, and specifications for measuring the
human well-being impacts of autonomous and intelligent systems.
ISO/IEC Standards on AI and Trustworthy AI: https://www.iso.org/committee/6794475.
html. ISO and IEC have established a joint committee to address artificial intelligence.
Several standards and technical reports have already been published, and many more are
in the pipeline. In particular, the recently published ISO/IEC TR 24028: Overview of trust-
worthiness in artificial intelligence15 provides an overview of reqirements and pitfalls in
developing and deploying a trustworthy AI system, and can be seen as a roadmap for
upcoming standards specifications.
NIST Standards for Trustworthy and Responsible AI: https://www.nist.gov/programs-
projects/trustworthy-and-responsible-ai. NIST’s project includes standards for several key
aspects of Trustworthy AI, including most recently a draft publication on mitigating
harmful bias16, as well as previously published standards on explainability and security.
CEN-CENELEC Committee on Artificial Intelligence: https://www.cencenelec.eu/areas-
of-work/cen-cenelec-topics/artificial-intelligence/. CEN and CENELEC have established the
new joint committee in response to the EC White Paper on AI and the German Standard-
ization Roadmap for Artificial Intelligence17.
27.1.2 Regulations
In particular in the EU, there has been a push to develop a digital strategy that goes beyond
guidelines, and imposes some regulation on the AI industry. The first piece of legislation in
this direction came with the General Data Protection Regulation (GDPR), which came into
force in 2018. Other regulations are in the pipeline – for example, the Digital Services Act
(DSA) and the Digital Markets Act (DMA), whose goal is to reduce the “Gatekeeper” effect of
13
https://altai.insight-centre.org/
14
https://ethicsinaction.ieee.org/#ead1e
15
https://www.iso.org/standard/77608.html
16
https://doi.org/10.6028/NIST.SP.1270-draft
17
https://www.din.de/en/innovation-and-research/artificial-intelligence
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.1 Legal and Soft-Law Framework 799
very large online platforms, and give users and consumers more transparency and choice
vis a vis these platforms (DSA), and enable smaller players to enter and compete within the
platform economy (DMA). However, while these regulations have elements with direct im-
plications for data collection and AI system transparency, the core regulation addressed to
AI is the EU AI Act, which came out in draft form in April 2021.
EU Digital Strategy: https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-
digital-age_en
GDPR: https://ec.europa.eu/info/law/law-topic/data-protection/data-protection-eu_en
DSA: https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/digital-
services-act-ensuring-safe-and-accountable-online-environment_en
DMA: https://ec.europa.eu/info/strategy/priorities-2019-2024/europe-fit-digital-age/digital-
markets-act-ensuring-fair-and-open-digital-markets_en
EU Draft AI Act: https://digital-strategy.ec.europa.eu/en/policies/european-approach-
artificial-intelligence
The draft AI Act addresses any AI systems being placed on the market, or put into use,
within the EU. It takes a risk-based approach to regulating AI, where risk does not just
entail physical or psychological harms, but also risks to fundamental rights. For the scope
of the regulation, the draft AI Act makes an intentionally broad definition of AI, and in-
cludes many algorithms whose inclusion as “AI” has triggered hot debate: not just machine
learning algorithms, but also logic-based methods and expert systems, statistical and Bayes-
ian techniques, optimization and search. The full listing is available in Annex I of the draft
AI Act.
The AI Act identifies four types of application which are prohibited, involving subliminal
manipulation, social scoring, and facial recognition:
AI systems that manipulate people and can lead them to behave in ways that are physi-
cally or psychologically damaging to themselves or to others.
AI systems that take advantage of vulnerabilities of particular groups, due to their age or
a mental or physical handicap, and can lead to behaviour that is physically or psycholog-
ically harmful to themselves or to others.
Social scoring by public authorities
The use of real-time remote biometric identification systems in publicly accessible spaces
for law enforcement purposes (however, this prohibition comes with several exceptions).
The main substance of the proposed regulation is, however, intended for high-risk applica-
tions. These are identified in Annex II – which includes a list of applications already subject
to sectoral regulation, and where the act imposes additional obligations – and Annex III,
which indicates eight new areas of application, with specific use cases within each area
identified as being high risk. Annex II includes, among others, AI systems used in toys,
machinery, medical devices, aviation, motor vehicles, and other forms of transport. The
areas of application listed in Annex III are:
1. Biometric identification and categorisation of natural persons
2. Management and operation of critical infrastructure
3. Education and vocational training
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
800 27 Trustworthy AI
■ 27.2 AI Stakeholders
AI Systems are embedded in complex ecosystems involving a broad range of actors. Under-
standing risks of bias, and how to mitigate them, involves getting a grasp on the various
stakeholders, their roles, and their needs. The following list can serve as a guide, but is by
no means exhaustive.
Data provider: organization/person that collects, processes, and delivers the data used
by the AI provider.
AI provider: organization/person that develops AI systems. Within the organization,
specific additional roles can be identified.
Management and Board
Legal department/Corporate responsibility department
Data Protection Officer
System Architects, Data Engineers
Developers, Machine Learning Engineers, Data Scientists
Quality Assurance
AI user: organization/person that deploys an AI system. Within the organization, specific
additional roles can be identified.
Management and Board
Legal department/Corporate responsibility department
Quality Assurance
Data Protection Officer
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.3 Fairness in AI 801
■ 27.3 Fairness in AI
What is a fair algorithm? According to the Oxford English Dictionary:
Fairness: Impartial and just treatment or behaviour without favouritism or discrimination.
This definition is not yet actionable -- in order to determine if an AI system is fair, the con-
cept needs to somehow be quantified. However, fairness is a social construct, and is depen-
dent on context and cultural/societal norms. This has led to the creation of many different
definitions of fairness (21 and counting18), each with its own mathematical formulation
(fairness metric) – as will be described below. To add to the confusion, the terms unfair
algorithm and biased algorithm are often used interchangeably.
Bias (Oxford English Dictionary): Inclination or prejudice for or against one person or
group, especially in a way considered to be unfair.
1.1 A concentration on or interest in one particular area or subject.
1.2 A systematic distortion of a statistical result due to a factor not allowed for in its deriva-
tion.
This conflation between unfair and biased may seem natural when considering the main
definition of bias. Nonetheless, it is important to consider that any classification model must
have bias in order to work. Take, for example, a classifier that has to differentiate between
pictures of mammals and of birds. It needs to have a bias towards labelling pictures of ani-
mals with wings as birds. Instead, if it were completely free of bias, it would not be able to
make any distinction at all, and would place all objects in the same category. So the first
clarification is necessary: algorithms need to avoid unwanted bias – bias which is based on
18
Verma, S. and Rubin, J., (2018), “Fairness Definitions Explained”, Proceedings of the International Workshop on
Software Fairness (FairWare), pp. 1–7.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
802 27 Trustworthy AI
some protected characteristic or some spurious correlation, and which is not relevant to the
task at hand.
Furthermore, within the engineering and statistics communities, a certain kind of un-
wanted bias already exists: bias according to Definition 1.2 (statistical bias). This often leads
to confusion and misunderstanding when discussing bias in machine learning: simply put,
an algorithm that is “fair” might still have statistical bias, while at the same time, a system
that is free of statistical bias might still be unfair.
The crux of the issue lies within the definition: “a systematic distortion of a statistical re-
sult” implies that a “ground truth” (or “true value”) is known so that a systematic distortion
can be detected by comparison. But what is this “ground truth”? If, as has traditionally been
the case, this is the current population parameter value, then it should come as no surprise
that, for example, a hiring algorithm for an engineering position trained on historical em-
ployment data, would disfavor women precisely because it accurately reflected the status
quo (and hence, had no statistical bias). This is not just a mere hypothesis -- consider the
case of Amazon’s ditched machine learning driven recruiting tool19. Conversely, in trying to
achieve greater gender equity and be “fair”, it could be deemed necessary to introduce
statistical bias into the algorithm. Of course, this contradiction between statistical bias and
fairness might not arise if “ground truth” were taken to be some idealized goal (i.e. the ideal
gender distribution of engineering employees). However, this is a controversial issue; and
changing the terminology would still leave unresolved the fundamental problem of what the
ideal distribution should be. For this reason, many current fairness metrics avoid the use of
a “ground truth” as a reference parameter.
In order to avoid confusion, in this chapter, we will use bias to describe inputs to, or proper-
ties of, a machine learning model (or more generally, an AI system). Fairness, on the other
hand, will be used to describe the impact of model-based outputs or predictions on various
protected demographics. This is also consistent with a growing body of literature, which
tries to identify and mitigate sources of bias in AI systems, and uses fairness metrics to
evaluate model effects.
27.3.1 Bias
Bias can come in many forms, and can enter the machine learning and data science life
cycles at various stages. To identify the four main stages:
1. The bias may be in the training or test data. Having large amounts of data does not auto-
matically absolve data collectors from the traditional statistical data errors. Sampling
bias, selection bias, and non-response bias are just some of the main traps that data holds
for the unaware. However, as the above example of training a hiring algorithm by using
historical data shows – even if the procedure for procuring the data was correct statisti-
cally, the data could still be biased because of embedded human biases. The hiring data
used to train the algorithm might accurately reflect the status quo – and thus encode and
perpetuate the current societal bias against women in engineering. Word embeddings
19
J. Dastin, (2018), ‘Amazon scraps secret AI recruiting tool that showed bias against women’, Reuters, 11 October
2018.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.3 Fairness in AI 803
and language models are another example of such kinds of bias – the text used to train
these models is full of societal biases, so that the word embeddings reflect not just gen-
eral semantic patterns, but also gender20 and ethnic21 stereotypes and prejudices.
2. Bias can also enter the system when designing the algorithm – for example, a classifica-
tion system could be biased because of the categories it is designed to select (black/
white, male/female22); biases could arise in feature engineering (some features might be
more predictive for some groups than for others, and selecting features based on overall
accuracy could cause the model to perform worse for some groups), or in the choice of
algorithm to use (for example, algorithms that are too simple can underfit the data, and
lead to bias in the models). A particularly insidious form of bias can enter the algorithm
design when attempting to model a concept that is not fully quantifiable – for example,
in a university admissions setting, using records of previously admitted students to train
a model for detecting successful candidates to a Ph.D. program23 (in fact, this simply
models previous admissions committees preferences and biases); or in a hospital care
management setting, using health care costs as a proxy for severity of the illness to be
treated24.
3. Biases can also enter the system post-hoc, for example, in the interpretation of the model
results. Alternatively, decisions based on model predictions could affect data that is then
fed back into an online learning algorithm, causing the formation of runaway feedback
loops25, and amplifying existing biases in the data or the model.
4. Finally, deployment is also prone to bias: from temporal drift, to inappropriate use (in a
context different from the intended one), and from adversarial attacks (consider, for ex-
ample, Microsft’s infamous Chatbot Tay26), to selective deployment (for example, using
predictive models to determine grades for children in larger classes, but using human
evaluation to determine grades for children in smaller classes27).
While it is not possible to list all possible kinds of bias that can become implicated in a
machine learning model, we briefly describe below some of the more common forms of
bias28.
20
Bolukbasi, T., Chang, K.-W., Zou, J., Saligramma, V., Kalai, A., (2016), ‘Man is to computer programmer as woman
is to homemaker? debiasing word embeddings’, Proceedings of the 30th International Conference on Neural
Information Processing Systems, NIPS 2016, pp. 4356–4364.
21
Manzini, T., Yao Chong, L., Black, A. W., Tsvetkov, Y., (2019), ‘Black is to Criminal as Caucasian is to Police: Detecting
and Removing Multiclass Bias in Word Embeddings’, Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 615–621.
22
Leufer, D., (2021), ‘Computers are binary, people are not: how AI systems undermine LGBTQ identity’, Access
Now, April 2021.
23
Burke, L., (2020), U of Texas will stop using controversial algorithm to evaluate Ph.D. applicants, Inside Higher Ed,
14 December 2020.
24
Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S., (2019), ‘Dissecting racial bias in an algorithm used to
manage the health of populations’, Science, Vol. 366, pp. 447–453.
25
Ensign, D., Friedler, S. A., Neville, S., Scheidegger, C., and Venkatasubramanian, S. (2018), ’Runaway feedback
loops in predictive policing’, Proceedings of the 1st Conference on Fairness, Accountability and Transparency,
PMLR, Vol. 81, pp. 160–171.
26
The Guardian (2016), ‘Microsoft 'deeply sorry’ for racist and sexist tweets by AI chatbot’, 26 March 2016.
27
Elbanna, A., Engesmo, J., (2020), ‘A-level results: why algorithms get things so wrong – and what we can do to fix
them’, The Conversation, August 19, 2020.
28
Suresh, H., and Guttag, J., (2021), ‘A Framework for Understanding Sources of Harm throughout the Machine
Learning Life Cycle’, arXiv preprint, https://arxiv.org/pdf/1901.10002.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
804 27 Trustworthy AI
Human Cognitive Bias: Any kind of bias that can occur when humans are processing
and interpreting information
Societal Bias: Biases and prejudices that arise from a social, cultural, or historical con-
text
Confirmation Bias: A tendency to accept model predictions that are consonant with ones
pre-existing beliefs
Group Attribution Bias: Occurs when it is assumed that what is true for an individual in
a group is also true for everyone in that group.
Automation Bias: A tendency to over-rely on outputs from a predictive model
Temporal Bias: Bias that arises from not taking into account differences in the observed/
measured quantities over time
Sampling Bias: Occurs when data is not sampled randomly from the intended popula-
tion, so that some individuals are more likely to be included in the sample than others.
Representation Bias: Arises when individuals or groups in a study differ systematically
from the population of interest. While this can include the case of sampling bias, it is a
broader concept. For example, even if data is sampled randomly from the overall popula-
tion, the sample sizes, or data quality, for certain subgroups can be low, leading to results
that do not generalize well to those subgroups.
Measurement Bias: This type of bias can occur when features and/or labels used in the
model are proxies for the actual quantity of interest, possibly introducing systematic
errors between what is intended, and what is actually measured (as in the example of
using health care costs to measure severity of an illness cited above29).
Evaluation Bias: Occurs when testing benchmarks are not properly calibrated, or when
performance metrics are not appropriate to the model’s deployment context. An often-
cited example of this would be the poor performance of facial recognition software on
women of color, because they were under-represented in the benchmark data sets used
for testing such software30.
Statistical Bias: The systematic difference between a statistical estimate and the true
underlying value (“ground truth”)
Given the multiple manifestations of bias, and the several stages at which they can enter the
machine learning life cycle, how can bias be detected? Bias in the training/validation/test-
ing data can often be detected through good data documentation practices (see Section
27.4.1), and through the traditional exploratory data analysis (EDA). However, sometimes
the bias in the data is too subtle; or else the bias arises at a later stage in the machine learn-
ing life cycle. In such cases, bias can only be detected through its effect on the model
predictions, by applying some fairness metrics.
29
Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, S., (2019), ‘Dissecting racial bias in an algorithm used to
manage the health of populations’, Science, Vol. 366, pp. 447–453.
30
Buolamwini, J., and Gebru, T., (2018), ‘Gender Shades: Intersectional Accuracy Disparities in Commercial Gender
Classification’, Proceedings of Machine Learning Research, Vol. 81, pp. 1–15. http://proceedings.mlr.press/v81/
buolamwini18a/buolamwini18a.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.3 Fairness in AI 805
In order to discuss the fairness metrics, it is convenient to recall the Confusion Matrix for a
binary classifier, and some associated values:
Lets consider an example which is by now a classic in the fairness in machine learning
literature: a model for predicting recidivism widely used in the US for making parole deci-
sions – COMPAS. In May 2016, ProPublica published an article31 indicating that the predic-
tions of a widely-used recidivism modelling model (COMPAS), were biased.
The debate between ProPublica and Northpointe, the company that developed COMPAS,
drew many academics, legal scholars, and civil rights advocates into its wake32 33. Several
issues were at stake, but we will focus on one aspect – Northpointe claimed that their algo-
rithm was fair, and had the data to show it. But ProPublica claimed that the algorithm was
unfair, and had the data to prove it as well. How could this be?
A brief survey of some of the most popular fairness metrics can help clarify the controversy.
Group Fairness. These fairness metrics are based on the premise that different groups should
be treated equally. Such fairness metrics require the identification of the sub-group or sub-
31
Angwin, J., Larson, J., Mattu, S., and Kirchner, L., (2016), ‘Machine bias: There’s software used across the country
to predict future criminals. And it’s biased against blacks’, ProPublica, 23 May 2016. https://www.propublica.
org/article/machine-bias-risk-assessments-in-criminal-sentencing
32
Dressel, J., and Farid, H., (2018), ‘The accuracy, fairness, and limits of predicting recidivism’, Science Advances,
17 January 2018: Vol. 4, no. 1. https://advances.sciencemag.org/content/4/1/eaao5580.full
33
Feller, A., Pierson, E., Corbett-Davies, S., Goel, S., (2016), ‘A computer program used for bail and sentencing
decisions was labeled biased against blacks. It’s actually not that clear’, Monkey Cage, 17 October 2016.
https://www.washingtonpost.com/news/monkey-cage/wp/2016/10/17/can-an-algorithm-be-racist-our-analysis-
is-more-cautious-than-propublicas/?noredirect=on&utm_term=.24b3907c91d1
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
806 27 Trustworthy AI
groups of the population that are potential victims of bias (also called protected group), and a
comparator group (unprotected group). In the COMPAS case, the protected group was African
American defendants, and the comparator group was white defendants.
Group Fairness (Demographic Parity, Statistical Parity). A classifier satisfies this defini-
tion if subjects in both protected and unprotected groups are equally likely to be assigned to
the positive predicted class. In the example of predicting recidivism, a model would satisfy
this definition if, for example, the probability of being predicted high risk for re-offending
was 20 % regardless of the defendant’s race.
Conditional Statistical Parity (Conditional Demographic Parity). This fairness metric re-
laxes the previous one, and permits to condition the result on a set of legitimate attributes.
For example, the recidivism prediction model could condition on the type of offense the
defendant was guilty of – and would be considered fair if the probability of being predicted
high risk for re-offending, given the type of offense that had been committed, was equal
regardless of race.
Predictive Parity. A classifier satisfies this metric if both protected and unprotected groups
have equal PPV. This means, of all defendants predicted by the model to be high risk, the
percentage who actually are high risk for recidivism is the same, regardless of race. For
example, if 80 % of African American defendants predicted to be high risk actually were
high risk, then Predictive Parity would hold if also 80 % of white defendants predicted to be
high risk, actually were high risk.
False Positive Error Rate Balance (Predictive Equality). A classifier satisfies this metric
if both the protected and the unprotected group have the same FPR. For example, if the
recidivism prediction model were to have an FPR of 15 % for both African American and
white defendants, then it would be considered fair by this metric.
False Negative Error Rate Balance (Equal Opportunity). This fairness metric is satisfied
if both protected and unprotected groups have the same FNR.
Equalized Odds. This metric combines False Positive Error Rate Balance and False Nega-
tive Error Rate Balance. It holds whenever the FPR and the FNR are equal for both protected
and unprotected groups.
Overall Accuracy Equality. A classifier satisfies this metric if the overall accuracy for both
the protected and theunprotected groups is the same. In the case of the recidivism predic-
tion model, Overall Accuracy Equality would hold, for instance, if the model predictions
were 85 % accurate for African American, and for white, defendants.
Test-fairness (Calibration). This metric is a more refined version of Group Fairness.
Test-fairness holds if, for any predicted probability score P, subjects in both protected and
unprotected groups are equally likely to belong to the positive class. For example, if the
recidivism prediction model were to satisfy Test-fairness, then for any value of P between 0
and 1, the likelihood that an African American defendant with predicted probability P of
re-offending belongs to the true high risk class must be the same as the likelihood for a
white defendant with predicted probability P of re-offending to belong to the true high risk
class.
Individual Fairness. The following fairness metrics are based on the notion of giving similar
predictions to similar individuals.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.3 Fairness in AI 807
Fairness Through Unawareness. A classifier satisfies this fairness metric if it does not
explicitly use any sensitive characteristics in making its predictions. In the recidivism pre-
diction model, this would mean that the attribute race was not included among the features
the model used to make its predictions.
Fairness Through Awareness. This fairness metric attempts to capture the concept that
similar individuals (excluding sensitive features – such as race in the recidivism model
case) should have similar predictions. Similarity is determined by a pre-defined distance
metric; and in order for a classifier to be fair according this definition, the distance between
the prediction outputs for individuals should not exceed the distance between the individu-
als.
There are many other fairness metrics, including several involving the use of causal reason-
ing and causal graphs that attempt to identify if there are causal links between sensitive
characteristics and model predictions. The interested reader is encouraged to consult the
articles of Verma et al.34 and Mehrabi et al.35 for a more in-depth discussion.
Now that we’re equipped with some basic understanding of fairness metrics, we can return
to the COMPAS controversy.
Northpointe said the algorithm is fair, because within each risk category, the proportion of
defendants who reoffend is approximately the same regardless of race. They were using the
Predictive Parity metric: The proportion of correctly predicted high risks is the same
regardless of demographic.
ProPublica said the algorithm is unfair, because among defendants who ultimately did not
reoffend, African Americans were more than twice as likely as whites to be classified as
high risk (42 percent vs. 22 percent). They also showed that white defendants had a higher
False Negative Rate. ProPublica used the Equalized Odds fairness metric: All groups should
have equal False Negative Rate and equal False Positive Rate.
It turns out, that the fairness metrics are not compatible – this is shown by some impossi-
bility theorems36. However, without getting into high-level mathematics, we just recall some
values from the Confusion Matrix above: Let p be the Prevalence (i.e. the proportion of high
risk individuals in a population), FPR be the False Positive Rate, FNR the False Negative
Rate, and PPV be the Positive Predictive Value. Then a little bit of Algebra and this formula37
... suffice to show that Equalized Odds and Predictive Parity cannot both hold true when p is
different for African American defendants vs. white defendants. In fact, the overall recidi-
vism rate for African American defendants is higher than for white defendants (52 percent
vs. 39 percent). The issues in predicting recidivism lie much deeper, and are difficult to
34
Verma, S. and Rubin, J., (2018), “Fairness Definitions Explained”, Proceedings of the International Workshop on
Software Fairness (FairWare), pp. 1–7.
35
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A., (2019), ‘A Survey on Bias and Fairness in
Machine Learning’, arXiv preprint. https://arxiv.org/pdf/1908.09635.pdf
36
Kleinberg, J., Mullainathan, S. and Raghavan, M., (2018), ‘Inherent Trade-Offs in the Fair Determination of Risk
Scores’, ACM SIGMETRICS Performance Evaluation Review, Vol. 46, No. 1, p. 40.
37
Chouldechova, A., (2017), ‘Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction
Instruments’, Big Data, Vol. 5, No. 2, pp. 153–163.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
808 27 Trustworthy AI
capture in one fairness metric. Dressel et al.38, and Richardson et al.39 point to some of the
other issues and challenges involved in attempting to use algorithms in complex situations
with deeply embedded societal biases, and where model predictions can have such a funda-
mental impact on those affected.
So we have more than 20 definitions of fairness, and they cannot all hold true simultane-
ously – this means that some fairness definition has to be selected a priori, in order to
determine if the model is biased or not. How can the right fairness definition be selected?
Unfortunately, there is no simple answer to this question. There is no regulatory guidance
(yet), and the fairness definitions do not clearly map onto legal concepts (see Wachter et al.40
for a discussion in European Law). It is therefore essential to consider the use context of the
model, and the stakeholders, and preferably to establish the fairness criteria in open discus-
sion with those most likely to be impacted by the model.
Checking for bias should be an integral part of the machine learning model training proce-
dure: Check for bias, de-bias the model, repeat.
De-biasing can happen at three stages:
Pre-process
In-process
Post-process
Pre-process de-biasing assumes that the training data is the source of bias, and modifies/
adjusts the data, or even creates “fairer” synthetic versions, in order to minimize the bias.
The advantage of this approach is that one is then completely free to choose the algorithm
to train for prediction. However, pre-process de-biasing cannot remove bias that arises
during the algorithm design, or later stages of development/deployment. Additionally,
de-biasing the data might actually just hide the bias, rather than remove it: for example,
Gonen et al.41 show that attempts to remove gender bias in word embeddings are actually
just concealing the bias.
In-process de-biasing directly modifies the model training process. It operates either by
integrating a fairness metric into the loss function, or by introducing a fairness constraint.
The advantage of this bias mitigation measure is that it provides the best trade-off between
fairness and accuracy. However, it can make the training process even less transparent,
possibly leading to even more hidden biases.
38
Dressel, J., and Farid, H., (2018), ‘The accuracy, fairness, and limits of predicting recidivism’, Science Advances,
17 January 2018: Vol. 4, no. 1.
39
Richardson, R., Schultz, J., and Crawford, K. (2019), ’Dirty Data, Bad Predictions: How Civil Rights Violations
Impact Police Data, Predictive Policing Systems, and Justice’, NYU Law Review, Vol. 94, N. 192.
40
Wachter, S., Mittelstadt, B., Russell, C., (2021), ‘Why Fairness Cannot Be Automated: Bridging the Gap Between
EU Non-Discrimination Law and AI’, Computer Law & Security Review, Vol. 41, (2021):105567.
41
Gonen, H., and Goldberg, Y., (2019), ‘Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in
Word Embeddings But do not Remove Them’, Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Vol 1, pp. 609–614.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.4 Transparency of AI Systems 809
Post-process de-biasing corrects for bias only after the predictions are made. This
approach does not change the data, or the prediction model itself, and can also be used on
proprietary black-box models (i.e. if the model has been developed by some other organiza-
tion). The disadvantage is that it does not allow for an optimal compromise between accu-
racy and fairness, and that it adds an extra step to the whole modelling process.
There are several open source toolkits that provide the tools for bias detection and bias
mitigation. Here are some of the most popular ones to date:
Themis https://themis-ml.readthedocs.io/en/latest/
LinkedIn Fairness Toolkit (LiFT) https://github.com/linkedin/LiFT
Google What-IF Tool https://pair-code.github.io/what-if-tool/
FairTest https://github.com/columbia/fairtest
AIF360 https://github.com/Trusted-AI/AIF360
42
EU Draft AI Act: https://digital-strategy.ec.europa.eu/en/policies/european-approach-artificial-intelligence
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
810 27 Trustworthy AI
In some cases, the regulator may require documentation (see, for example, the draft AI
Act, Articles 9 and 10, and Annex IV), or even an explanation (arguably, this is contained
in GDPR Article 22 and more clearly laid out in Recital 71)
It might be necessary for a third-party auditor to assess conformity of the AI system with
regulatory requirements.
Finally, clearly documented models whose predictions can be adequately explained, are
more easily trusted.
In the sections that follow, current best practices for data and model documentation will be
described; the chapter closes with a discussion of explainability.
Data is one of the major ingredients in a machine learning system, and can have a profound
impact on final model performance. Yet surprisingly, until 2018, very few efforts were made
to document how that data was collected, pre-processed, and used. As noted by Gebru
et al.43:
In the electronics industry, every component, no matter how simple or complex, is accompa-
nied with a datasheet that describes its operating characteristics, test results, recommended
uses, and other information. By analogy, we propose that every dataset be accompanied with
a datasheet that documents its motivation, composition, collection process, recommended
uses.
While, for example, the draft AI Act explicitly requires some form of datasheet as part of the
model documentation for high risk AI systems, there are other good reasons for creating
them:
Increased transparency and accountability for the data set creators
Other researchers can more easily reproduce model results, increasing trust in the mod-
el’s performance
Unwanted biases arising from biased training data can be detected more easily, and
mitigation measures adopted
Potential data set users can more easily determine if the data set is suitable for their
needs
A datasheet, as proposed by Gebru et al. (see also Bender et al.44 for a similar concept spe-
cifically for Natural Language Processing data) is a series of questions about the data set
collection process and composition, that is also an opportunity for the data set creators to
reflect on their data collection process. Questions include:
Why is the data being collected? Who requested/funded the data collection, and for what
purpose?
43
Gebru, T., Morgenstern, J., Vecchione, B., Wortman Vaughan, J., Wallach, H., Daumé III, H., Crawford, K., (2020),
‘Datasheets for Datasets’, arXiv preprint. http://arxiv.org/abs/1803.09010
44
Bender, E., and Friedman, B., (2018), ‘Data Statements for Natural Language Processing: Toward Mitigating
System Bias and Enabling Better Science’, Transactions of the Association for Computational Linguistics, Vol. 6,
pp. 587–604. https://aclanthology.org/Q18-1041.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.4 Transparency of AI Systems 811
What is the composition of the data set (for example, does it contain tabular data, or text
data? Multi-modal data?)
How was the data collected? What sampling procedure, why was that method chosen, and
sampled from which population?
Was the data labelled? If so, by whom?
What were the pre-processing steps?
What are the results of the Exploratory Data Analysis? (For example, how many items,
how many features, correlations, sensitive characteristics, etc.)
What are the intended uses of this data set, and are there uses to avoid?
Will the data set be distributed? If so, how?
Who is repsonsible for the maintenance of the data set?
Such documentation could go a long way towards avoiding certain kinds of bias – for exam-
ple, understanding the composition of the training data might have alerted developers to the
fact that people with darker skin tones were under-represented in the data, before they de-
ployed a model that labelled some people as Gorillas45; while knowing who labelled the data,
and what societal biases they might bring to the table, could have lead to the earlier detec-
tion of picture captioning models that perpetuate gender stereotypes46.
Very easily overlooked stakeholders in AI development are the workers who are employed
to label the data. Very often, these are poorly-paid crowd workers. Documenting the data
and how it was collected could also contribute to improving the labor conditions of these
crowd workers47 – having to be transparent about who labelled the data and how much they
were remunerated might incline tech employers to offer more advantageous working condi-
tions.
Documentation for the machine learning model follows the same principles as documenta-
tion for the data sets: such documentation should help increase transparency and account-
ability for the model developers; allow for reproducibility of models and/or their results;
include measures for model bias; and avoid model use in inappropriate contexts.
Some examples of what such model documentation could look like can be found in Mitchell
et al.48, or in Annex IV of the draft AI Act. In particular, the following information should be
included:
45
Simonite, T., When It Comes to Gorillas, Google Photos Remains Blind, Wired, 1 November 2018.
https://www.wired.com/story/when-it-comes-to-gorillas-google-photos-remains-blind/
46
Simonite, T., When AI Sees a Man, It Thinks ‘Official.’ A Woman? ‘Smile’, 19 November 2020.
https://www.wired.com/story/ai-sees-man-thinks-official-woman-smile/
47
Semuels, A., ‘The Internet Is Enabling a New Kind of Poorly Paid Hell’, The Atlantic, 23 January 2018.
https://www.theatlantic.com/business/archive/2018/01/amazon-mechanical-turk/551192/
48
Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., Gebru, T., (2019),
‘Model Cards for Model Reporting’, FAT ’19: Proceedings of the Conference on Fairness, Accountability, and
Transparency. http://arxiv.org/abs/1810.03993
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
812 27 Trustworthy AI
Basic logistic information about the model, such as person or organization developing it,
licenses, and where to send questions or bug reports..
Basic descriptions of the training algorithms, the model parameters, optimization func-
tion, fairness constraints, and features.
Intended use, in particular optimal uses, primary intended users, and use cases to avoid.
Under which conditions – for example, using which target populations; if audio/visual
input is required, which kind, and under what environmental conditions – the model
performance has been tested.
Which metrics were used to measure model performance? Which fairness metrics? What
decision thresholds were used?
Details on the training and test data (see Documenting the data section above)
We also recommend adding information on energy efficiency and environmental impact –
research49 50 indicates that the energy consumption of storing large amounts of data, and
training and deploying very large complex models (for example deep learning models) is
significant.
27.4.3 Explainability
Explainability is another term in AI that is often discussed, but rarely defined; it is also
often used interchangeably with interpretability. However, we adopt the following working
definition: explainability is the property of an AI system to provide evidence or reasons for
its outputs and/or processes51.
As already discussed in the chapter introduction, explainability can be essential for detect-
ing unwanted biases, testing and debugging models, ensuring safety – and establishing
trust in the model and its predictions. For example, as Ribeiro et al.52 show, it is possible to
train a classifier to distinguish between huskies and wolves – yet, only when model expla-
nations are added does it become clear that, due to some data collection artefacts (all pic-
tures of wolves had snow in the background), what was actually trained was a snow versus
no-snow classifier. When data sets become so large that such data artefacts are not easily
detected, it is difficult to trust the outputs of complex prediction models, even if they seem
to be performing well.
Given that explainability is a desirable property of an AI system, how can it be achieved?
The first fundamental decision comes at the algorithm selection phase: should an intrinsi-
cally explainable algorithm be used (such as linear/logistic regression, decision trees, or
Naive Bayes), or should some post-hoc methods be applied to a more complex algorithm
49
Strubell, E., Ganesh, A., and McCallum, A., (2019), ‘Energy and Policy Considerations for Deep Learning in NLP’,
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645–3650.
50
Bender, E. M., Gebru, T., McMillan-Major, A., Shmitchell, S., (2021), ‘On the Dangers of Stochastic Parrots:
Can Language Models Be Too Big?’, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency, pp. 610–623.
51
NIST, ‘Four Principles of Artificial Intelligence’, https://nvlpubs.nist.gov/nistpubs/ir/2021/NIST.IR.8312.pdf – see
definition of “Explanation”
52
Ribeiro, M. T., Singh, S., Guestrin, C., (2016), ‘“Why Should I Trust You?” Explaining the Predictions of Any
Classifier’, arXiv preprint. https://arxiv.org/pdf/1602.04938.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
27.4 Transparency of AI Systems 813
(random forests, or neural networks, for example). Depending on the use case and the avail-
able data, a simpler and intrinsically explainable model can be a viable and competitive
option. A useful approach would be training a simple model in addition to more complex
variants in order to understand the trade off between performance and explainability.
Additionally, it is important to understand the explainability measures to use: this is very
often determined by whom the explanation is for, and their level of machine learning exper-
tise. Should the explanation method be model specific, or can it apply to many different
kinds of models? Should the explanation be local – providing an understanding of why the
model produced a prediction for a particular instance, or should the explanation be global,
addressing the entire model behaviour? Finally, how should the model explanations be
delivered? Popular methods include:
Feature summary statistics: this could be feature importance information, showing how
much each feature contributed to the model decision; or in the form of a correlation
matrix, showing the correlations between the features and the predicted output.
Feature summary visualization: for example, curves that show a feature and the aver-
age predicted outcome for that feature
Model internals: this includes variable coefficients for linear or logistic regression; the
learned tree structure in a decision tree; or else learned weights in the neural networks
setting. However, while the internals for the simpler models can deliver reasonable expla-
nations, understanding the implications in the neural networks case could be a challenge.
If the prediction model is based on a complex algorithm, it could be possible to approxi-
mate it (either locally or globally) with an intrinsically interpretable model.
Two very popular frameworks for post-hoc explanations are LIME53 and SHAP54.
LIME provides local explanations to any kind of model (i.e. it is model agnostic). In order to
explain the model prediction for a particular instance, LIME perturbs the instance, and ob-
tains the model predictions for the new perturbed points. It weights the perturbed points
according to their distance from the original instance, and then trains a weighted linear (or
any other kind of interpretable) model on the data set of perturbed points.
SHAP55 is based on Shapley values, which use coalitional game theory to compute the “mar-
ginal utility” of each feature in determining a local prediction. These local predictions are
then added up in a linear model, in order to produce a global explanation. A word of caution:
SHAP must have access to the model’s entire training data in order to function.
Model-agnostic post-hoc explanations such as SHAP and LIME can be a useful resource
when trying to understand the behaviour of, for instance, a deep neural network-based
model. However, as the NIST paper points out, there are a few other requirements for an
explainable AI – and a fundamental one is Explanation Accuracy: An explanation correctly
reflects the reason for generating the output and/or accurately reflects the system’s pro-
cess56. Unfortunately, recent research has shown that SHAP and LIME explanations can be
53
https://github.com/marcotcr/lime
54
https://github.com/slundberg/shap
55
https://github.com/slundberg/shap
56
NIST, ‘Four Principles of Artificial Intelligence’, https://nvlpubs.nist.gov/nistpubs/ir/2021/NIST.IR.8312.pdf
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
814 27 Trustworthy AI
manipulated57 58 – leaving some doubt as to how accurate their explanations are, and when
they break down.
■ 27.5 Conclusion
Trustworthy AI is a field still very much in its infancy. While there is already a large body
of so-called ethics guidelines, the regulatory and standards framework is still under con-
struction, and practical tools, where they exist, are fragmented and often usable only by a
small group of experts. This unsettled state of affairs, combined with the buzzword status of
the term “Trustworthy AI”, can make it appear more of a special “add-on”, or “nice-to-have”
feature, rather than what it should be: an integral part of the AI system life-cycle, ensuring
that the AI solution will be beneficial, and reliably deliver what it promises.
■ 27.6 In a Nutshell
57
Slack, D., Hilgard, S., Jia, E., Singh, S., Lakkaraju, H., (2020), ‘How can we fool LIME and SHAP? Adversarial
Attacks on Post hoc Explanation Methods’, AIES ’20, 7–8 February 2020.
58
Dimanov, B., Bhatt, U., Jamnik, M., Welle, A., (2020), ‘You Shouldn’t Trust Me: Learning Models Which Conceal
Unfairness From Multiple Explanation Methods’, 24th European Conference on Artificial Intelligence – ECAI 2020.
28
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
Epilogue
Stefan Papp
Bob remembered the old office, which was gloomy and uninviting. He recalled that when-
ever he’d entered those looming, mahogany doors, he’d always looked forward to getting out
again. The new office, however, invited anyone who entered it to stay. With the new office
building, it felt like the company had moved into the 21st century at last.
“Sometimes, there is a vibe in the air,” Bob heard from Pratima, the HR manager, whom he
had met the first time three years earlier during the interviews. You cannot describe it, but
you feel that you are in an energetic environment, with a constant buzz and activity. And
then you just want to be part of it and to contribute.”
The personnel of Halford were proud of the new office, and Narine, the new ESG manager,
was the first to speak at the party’s opening ceremony after a quick introduction by Alice as
a member of the new generation of employees.
Bob saw Narine for the first time as she walked up onto the stage. She walked confidently to
the microphone, and her smile was welcomed with massive applause from the audience.
“I am honored to be here, and I cannot thank Alice enough for her kind introduction. I was
asked to tell my story to everyone here, as we have guests who may not have heard it. I
founded a small company with data scientists to automate ESG reporting some years ago.
We had some ideas on how to use algorithms to predict greenhouse gas emissions. Much of
our work is based on exploring data on how materials in facilities change over time and how
this impacts the facilities’ balance of emissions and energy.
Three years ago, a think tank in Halford explored data science use cases. During a project
to improve energy efficiency, one of their data scientists proposed using a byproduct of
Halford manufacturing to isolate their factory walls. He came up with an idea to model the
impact, but at some point, he needed external help, and Halford hired us to work with them.
Two weeks later, the data scientists presented their findings to the board. The initial presen-
tation left some questions unanswered. Some managers turned the idea down. But it was
never about yes or no for Alice and her team. They were determined to go deep with every
idea and kept themselves open-minded by asking, “how can this work?” or, “what would be
needed to make this a commercial success?” They teamed up the data scientists with one of
their sales legends: Tom had an excellent gut feeling about what their clients wanted, and
he asked the right questions. I remember his words, “hell, if you shove such appealing num-
bers under your clients’ noses that they feel they’d be stupid not to buy, what do you expect
is gonna’ happen? That baby’s gonna’ fly! Give me a product, and I will sell it until you beg
me to stop.” Narine paused, grinning, as laughter rippled through the room. “With time, the
team developed a data-driven process that collected enough evidence for most clients on
their benefit from our newly created isolation material. The rest is history. Halford acquired
my little company, and we became a new business unit.
It was also clear that we must practice what we preach. So, one of the company’s objectives
was to build a new head office that follows all energy efficiency principles. My two col-
leagues, Ani and Arpine, will show how we use data science to calculate energy improve-
ment.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
28.1 Halford 2.0 817
Two young data scientists entered the stage. They smiled a lot, though they stood somewhat
rigidly, and talked softly, as they began to present the first insights into how they calculated
the carbon footprints of the new and old offices. Bob had overhead more than a little skepti-
cism about this venture in recent weeks, but as slide after slide of graphs and results were
met with whoops and bursts of applause, it seemed even the harshest of critics were coming
around. As the presentation progressed, Ani and Arpine’s voices grew louder, and they be-
gan to move around the stage, approaching their audience with ease. They finished by quot-
ing Tom, who called the old facility “a damn old garbage disposal.” Many of the audience
clapped and nodded in agreement. The future was here, and everybody knew it. A central
part of it stood on energy-efficient technologies.
28.1.2 HR
Next, the HR manager, Pratima, promenaded onto the stage. She wore a heavily embroi-
dered black-brownish-red scarf with Indian designs and the Lotus flower necklace, which
Bob had observed the first time he saw her. After approaching the podium, she paused,
slowly turning to face each section of the audience directly.
“Good morning, esteemed colleagues, visionaries, and creators. Today, we stand on the cusp
of the next frontier in data science and engineering. In the words of the Vedas, ‘Ekam sat
vipra bahudha vadanti’—truth is one, the wise call it by many names. This profound princi-
ple resonates deeply with our work in data science.
When somebody talks to us, we listen. Applicants also talk to us through their public pro-
files on the internet. Here, we listen, too. Screening social media accounts—information that
people share voluntarily with the public—can help us find discrepancies.”
Pratima let her words sink in for a moment.
“I see your faces, and I see what is left unspoken. Some of you might be grateful. AI might
prevent us from hiring colleagues who turn out to be an unpleasant person later. Others
might be concerned. What if AI filters out the wrong people?
I used this example to show what is key to working with AI, especially for HR: Trust. Trust
can only be achieved by absolute transparency. There is no alternative to that. We shared
every detail of our analytical work on our intranet page.” At this, Pratima proudly gestured
to the projector screen, now featuring a screenshot from the company’s wiki.
“We’ve also established a new innovative corporate culture. Now, at Halford, every employee
is also an entrepreneur. We expect our people to express ideas and come forward to chal-
lenge the status quo. The team at our newly formed innovation lab will take their time to
explore every idea thoroughly.”
The slides switched again, and a few of the audience cheered at the smiling team photo
beaming down at them.
“Often, people ask me if listening to our employees all the time will only get us sidetracked.
My answer is always the same: Some ideas may potentially become part of our future core
business. A company’s gold lies in the minds of its employees; it would be foolish not to use
them. Thanks to new forms of AI-driven research, we can verify the value of your ideas
quicker than ever before.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
818 28 Epilogue
When I started at Halford, we needed to approach talent and ask them to work for us. Some
potential candidates turned us down because, back then, we were ancient. Today, many ap-
plicants approach us because our reputation precedes us. Sometimes, we even have to turn
down applicants for which we would have rolled out the red-carpet years ago.
I wish you all a great success in the future. If you need anything, HR is always here to help.”
The audience was still applauding as Pratima left the stage, to be replaced by Mahsa, the
lead data scientist for the Customer Satisfaction team. She walked onto the stage wearing a
Star Trek uniform and raising her right hand to a Vulcan salute.
“Greetings, earthlings!”
Some in the audience giggled.
“Today, I will present a multichannel strategy for staying connected with clients. For that,
I borrowed an enhanced transporter that can travel through time. Scotty, beam us into the
past!”
The presentation screen showed a dimly lit bedroom with a young man sitting in bed in his
pajamas.
“We are now in the bedroom of one of our clients. He just woke up, it’s 3 a.m., and his favor-
ite product stopped working. One LED is blinking red. He wants to fix it, since he has noth-
ing else to do and cannot fall asleep again. He tries to contact customer support.”
The screen switched to show a long chat dialog.
“I don’t expect you to read all that. But if you skim through it, you can see that the operator,
obviously an old bot, returns nothing helpful to questions like ‘Why is my device blinking
red?’ Our customer might want to call the service desk to talk with an actual human, but he
would find out they only operate during business hours. This was us some years ago.
Scotty, back to the present!”
With a Scottish accent, someone in the crowd shouted, “Aye, aye, lass!”
“Don’t call me lass; for you, it is Ma’am!” Mahsa laughed. “Alright. It’s again 3 a.m. Our cli-
ent’s favorite product is again not working as expected. But this time . . .”
The presentation screen showed a chat again.
“. . . the problem gets solved in the AI chat. As you can see, the chatbot tells the client to
check the batteries, and they find out that the charger has been damaged all along and,
therefore, the device has almost been discharged. I find that already fascinating. But there’s
more: Scotty, beam us to Alexia! Alexia lives in Athens and faces the same problem with her
favorite product at 3 a.m. Unfortunately, Alexia does not speak English well. But if you look
at her screen . . .”
Again, the presentation screen showed a new chat dialog.
“The text may look Greek to you, and that’s because it is! Thanks to our multilingual chat-
bots, Alexia can talk to our AI customer service chatbot in her native language.”
“What about Klingon?” shouted someone from the audience, to a round of laughter.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
28.1 Halford 2.0 819
“We have not yet received a request from Qo’nos, but we could add Klingon as well,” Mahsa
joked.
“But let’s stay on Earth and teleport to the future. Again, it is 3 a.m. But this time, we do not
write to our AI customer representatives. We talk with them. You communicate with an AI
like you would with the smartest service representative in your native language. I find that
fascinating, to the power of 10.”
At this, someone in the audience stood up and shouted, “Great job! But how about fixing the
issue so our poor client does not have to talk with us at 3 a.m.?”
“Great question, thank you! Okay, let’s explore one unexpected but pleasant side effect of
our journey. When we first started this project, we didn’t expect that by using chatbots, we
could integrate our customers into our company. Before AI, our clients used dull chatbots,
leaving many questions unanswered. In a modern world, clients can interact with us any-
time without any wait time. They can share their ideas with us, and an AI chatbot will give
them an appropriate answer. Our customer touchpoints have become a new source of
innovation. Clients feel themselves being a part of our company. Client feedback leads to
improvements, and we can fix problems fast, before they keep our customers awake at 3 a.m.
But there is more: Client feedback also leads to new features and sometimes even new prod-
ucts. We may find the next big thing by having an AI listen to our clients. Now that’s fasci-
nating, to the power of googolplex!”
“Could you share your thoughts about work culture in projects?” asked Pratima from the
front row.
“When you give curious engineers powerful tools, you create a virtuous cycle,” Mahsa re-
plied. “Many of my team members come from large companies where they were not allowed
to work independently. It still surprises me that so many organizations hire smart people
and then block them from being innovative. Deep down, I believe everyone looks for mean-
ing in their profession, and when you feel that your profession is a calling, success is just a
matter of time.”
28.1.4 Production
After Mahsa finished the Vulcan salute and said, “Live long and prosper,” Rashid and Hank
came onto the stage to her. They sat on chairs colleagues had brought on the stage during
the applause.
“Rashid is one of my beloved co-workers,” Masha began. “We sometimes disagree, but learn
so much during our ‘discussions’ that we are almost afraid we might agree too much one
day.
Rashid, you oversee modernizing the factory. How was the journey?”
“I admit, in the beginning, it was more difficult than I expected. Many factory workers were
afraid that AI could replace them. When I started, there was an accident in the factory. Two
workers were seriously injured. Of course, it was impossible to undo the accident, but it is
possible to prevent similar future accidents from happening. This was the ice-breaker.
Workers learned that we are not the enemy but can help them.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
820 28 Epilogue
A computer vision system now alerts supervisors about dangerous activities. We trained our
applications to detect workers who did not wear helmets and further dangerous scenarios.
We are continuously improving, and keep training our model for new dangers.”
Nodding, Mahsa turned to Hank. “How was the project for you?”
Hank, the production manager, still had his bushy “philosopher’s beard,” but to Bob, he now
looked less like Karl Marx and more like an ancient Greek philosopher.
“The workers all agree,” Hank said, his deep voice even sounding like a philosopher.
“A worker’s safety is most important. AI supervision is therefore acceptable if it can prevent
accidents and even may save lives.
Still, after we rolled this solution out, I was worried. Even if we learned to prevent accidents,
I feared that the next iteration of machines might want to tell us what to do, and the iteration
after that would make us obsolete.
Some of these fears might sound strange to some data professionals, but put yourselves into
the shoes of a factory worker. You do not have an education that allows you to transition to
other jobs quickly, and suddenly, a machine is doing the work you’ve been doing for years.
What happens next?”
“What eventually convinced you to support the work of our data professionals?” Mahsa
asked.
“Rashid and I talked a lot about the future of our factories. One day, he gave me some num-
bers that made me think. I had a sleepless night. These numbers would change everything.
The next days, I talked with my people about the numbers, and lastly, we had to admit that
we needed to do something.”
Masha turned to Rashid. “Which numbers did you give him?”
“I collected statistical data on demographics in the regions of our factories. I also collected
insights about birth rates, migration, and many other aspects that would affect the popula-
tion around the factories. The conclusion was that in the future, we may find only one new
hire for every two people who retire. We would have to close some sites without innovation
to operate with a smaller staff.”
“Our focus was to automate all dangerous human activities,” Hank said. “During that phase,
some of my factory workers transitioned to automation engineering. They know the ma-
chines inside out and this helps them to team up well with data scientists.” Hank paused for
a moment “I am not the best person to praise technology. But I feel we are going in the right
direction. If we manage to create better products over time, we will always have jobs.”
28.1.5 IT
After the speeches, Bob made his way to the queue for the buffet. At a tap on his shoulder,
he turned around to find someone he’d have expected to see at the front of the line already.
It was Peter, the IT manager.
“Do you still remember me?”
“Of course,” Bob said, though he could understand the reason for the question: Bob hadn’t
seen Peter since he quit, just after the data-driven initiative was launched. He would have
almost not recognized him. He had lost significant weight.
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
28.1 Halford 2.0 821
Peter followed Bob’s eyes down to his T-shirt, which read, “La Brujaria—Latin American
Kitchen.”
“I made my second business to my first. The catering is part of it,” Peter said with a grin.
“My clients chase me so much that I hardly manage to be my best customer anymore.
Have you tried my South American wine? Let me tell you, once you taste this masterpiece,
you’ll never want to drink the swill they sell in the supermarkets again. I have a special
friend who arranges things. All I need is to make a few calls and, well,” he winked.
“Sorry again that I had to take Cesar with me. But one day, he brought this delicious em-
panada into my office, and I knew I was sold. He didn’t want to join me at first because he
believed in a business career. He eventually wanted to be a boss, so I made him my chef.
Let’s say my new life is wonderful. I never realized how unhappy I was in my old job. People
often told me that I had become a dickhead. But if you are responsible for everything in IT,
what else should you do? You sometimes have to show them they cannot do everything.
Now, I am far more relaxed. I cannot stress how happy I am. You know, if Cesar or someone
else gets on more nerves today, I smile.”
“Uhuh,” Bob said, only half paying attention. He wanted to break away, but Peter continued
to talk.
Oh gosh! Bob thought, looking hopelessly at the buffet table as it grew progressively emp-
tier. He’s going to tell me all his life stories again!
Bob went to interject, but at that moment, Peter’s eyes wandered to the entrance and his
face lit up. ”Suzie!” He bellowed, and hurried off. Bob never got to see whether Suzie was as
happy to see Peter as he was to see her, as he was distracted by a voice at his side.
“What people often fail to see is the work he did in the very beginning. He was IT leader not
without a reason.”
Bob turned to see Alice, wearing a sly smirk. “Unfortunately, at some point, he must have
burned out on the way and started to act like he was already half retired. He used to get away
with too much; endless coffee chats, silly jokes, and that constant babbling and bragging!
Although many colleagues avoided him, at his core, he was a nice person who was often
simply too worried that someone could ruin something in his domain that he could not
control.
This was one of the hardest decisions in my career. You have a colleague who has contrib-
uted to the company’s past success, but you know that he is not up to future challenges.
Some board members saw in Peter a silverback who would retire in a few years anyway.
They advised me to phase him out gradually, build teams around him, and keep him away
from the new stuff. But then again, I thought this was unfair. So, I risked opening Pandora’s
box when I openly told him I was unhappy with the work culture in IT. Surprisingly, we
landed a deal that was good for both sides. It seems he has really strained himself with his
position and was happy to get out.”
Alice pointed at a young man.
“That’s Bill, our new IT manager. He is an introvert who tends to fix problems quietly. We
soon realized that if we gave him more room to operate, he would also prevent bigger prob-
lems from happening. In our new corporate culture, we focus on facts and not on bragging,
and with that, HR identified him as a ‘most valuable employee’ in a fact-finding campaign.
He was not the only undervalued employee in the company.”
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
822 28 Epilogue
28.1.6 Strategy
I googled, learned about the most innovative corporate offices1, and even visited some.
When I realized that I did not want to return to our old, dirty corporate office, I knew we had
to move forward.
Now, every day, when our colleagues get to our office, they are reminded that this is a mod-
ern environment, and we expect them to think differently about us.”
“If there is one thing you want to tell other companies who want to transition, what would
it be?”
“The worst thing you can do is stay in a fear culture. Being afraid of making mistakes does
not bring you forward. I hear many ‘what if’ questions from executives. ‘What if we found
out that artificial intelligence is not as powerful we’d hoped?’ or ‘what if our transition plans
fail and we lose our most important employees?’ The only valid ‘what if’ question is ‘what if
the world is changing around me and I do not adapt?’ Doing nothing out of fear is the biggest
risk one can take.
“Can you put that in practical terms?”
“Sure, take a fresh look at your own organization and start asking questions you have not
asked before. Understand your value creation, then ask how AI can multiply this value. You
might make mistakes on your way, but that is ok. Learning from mistakes means you are
making progress.
Nobody wants a culture of mediocrity. One friend from IT once said, ‘Pyramid hierarchies
are like buying multi-core processors but writing software that uses only a few cores.’ Every
employee should, therefore, think for the company. This does not work if managers con-
stantly micromanage their employees.
As a leader, be a coordinator of success and ask yourself how individual team members can
contribute to success and how you can empower them. Becoming an AI company also made
us become a people company. Per aspera ad astra—through difficulty to success.“
1
https://www.forbes.com/sites/alankohll/2019/01/24/how-your-office-space-impacts-employee-wellbeing/
?sh=523e67364f31
2
We even had to dial some of the worst stories down, else they might have sounded unbelievable!
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
824 28 Epilogue
in meaningless meetings and filling out forms. Great ideas were often turned down because
they were too risky. Everyone in an unsuccessful environment tried not to make any mis-
takes, but unfortunately, making no mistakes does not automatically mean you’re doing
everything right. Companies moved forward once the leadership learned that whatever got
them to a successful present wouldnot take them to a successful future, and that they in-
stead had to start to think differently. Maybe it is not a surprise that “think different” is the
motto of one of the most successful companies in the world.
■ 28.3 In a Nutshell
Thank You!
Thank you for reading this book. Please do not hesitate to review us on Good
reads3 or Amazon4.
Feel also free to reach out to the authors directly. They can all be found on
LinkedIn or similar platforms. We love to get feedback and hear what we can
improve on. Constructive feedback will be integrated into future editions.
We aim to provide readers with a holistic book about data science and artificial
intelligence. Do you have an idea for a chapter that might be still missing?
Are you keen to write yourself? Send us your ideas; we look forward to hearing
from you!
3
https://www.goodreads.com/search?q=handbook+data+science+and+ai
4
https://www.amazon.com/s?k=Handbook+data+science+and+ai
29
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
The Authors
Dr. Danko Nikolić is an expert in both brain research and AI. For
many years he has run an electrophysiology lab at the Max-Planck
Institute for Brain Research. Also, he is an AI and machine learn-
ing professional heading a Data Science team and developing com-
mercial solutions based on AI technology. He invented AI Kinder-
garten–a concept for training AI of the future for achieving near
human-level intelligence. He also pioneered using machine learn-
ing to read “minds” from the brain’s electrical signals; he and his
team were able to reconstruct what an animal was seeing solely by
analyzing the brain signals. He introduced the concept of ideas
thesia (“sensing concepts”) into neuroscience and is the author of
a theory called practopoiesis describing how biological systems
achieve intelligence. He has a degree in Psychology and Civil Engi-
neering from the University of Zagreb, Croatia and a PhD from the
University of Oklahoma, USA. He was an honorary professor at the
university of Zagreb from 2014 to 2019.
Index
C configuration management 61
confirmation bias 17 see bias
calculus 253
conflicts of responsibility 650
calibration 586
confusion matrix 284
camera obscura see pinhole camera
consistency 178
cap and trade 762
container 687
capital allocation line 390
Contextual Word Embeddings 536
CAP theorem 178
continuous bag-of-words 533
carbon accounting 763
convolution 416, 441
carbon capture and storage (CCS) 770
Convolutional Neural Networks (CNN) 357,
carbon capture and utilization (CCU) 770
436, 441, 527
carbon credits 762
convolution layers 450
Carbon Engineering 769
Conway’s law 145
carbon sequestration 767
copyright 488
catastrophic forgetting 479
Corner detector 560
Central Limit Theorem 264
correlation 265, 390
central organization 655
CosmosDB 41
Chain of Thought prompting 474
covariance 265
change ambassadors 171
COVID-19 604
Change Data Capture (CDC) 137, 156
– model calibration 608
change management 171, 678
– model implementation 609
Charge-Coupled Device (CCD) 554
– parametrization 608
Charter of Fundamental Rights 701
– structure and scheduling 606
ChatGPT 454, 462, 708
C#, programming language 126
Chief Data Officer 654
CPU 46
China 784
critical thinking 1
Cholesky decomposition 253
cross-validation 288
churn rate 751
crypto currencies 390
CIA triad 216
CSV 83
CI/CD 133
cumulative variables 391
classification model 271, 336
customer churn 161
Climeworks 769
customer journey 731
cloud 662
customer satisfaction 726
cloud-based BI 332
cybersecurity 215
cloud provider 39
cluster 351
Code Injection see LLM Attacks D
Cognos Analytics 327
DAG (Directed Acyclic Graph) 112, 146
Color Filter Array (CFA) 554
Dagster 113–114
column-based security 184
daily standup 676
column chart 631
Dartmouth conference 381
commodity price 743
dashboard 318
completeness 178
data
compression 557
– availability 580
computational photography 560
data access policies 185
computerized model 584
data analysis 318
computer science 627
data architecture
Computer Vision 547–549, 559–561, 564
– maintainability 119
conceptual data model (CDM) 308
– scalability 89
confabulation 487
database management systems (DBMS) 323
configuration drift 57
© Dies ist urheberrechtlich geschütztes Material. Bereitgestellt von: Mercedes-Benz Group AG 20.10.2024
836 Index
ISBN 978-1-56990-934-8
9
9 781569
7 8 15 6 9 909348
9 09348