Righthing Software
Righthing Software
Juval Löwy
“I attended both the Architect’s Master Class and the Project Design Master Class.
Before these two classes I had almost lost all hope of ever being able to figure out
why the efforts of my team were never leading to a successful end, and I was strug-
gling to find a working solution to stop the insane death march we were on. The
Master Classes opened my eyes to a world where software development is elevated
to the level of all other engineering disciplines and is conducted in a professional,
predictable, and reliable manner, resulting in high-quality working software devel-
oped on time and within budget. The knowledge gained is priceless! From revealing
how to create a solid and sound architecture, which withstands ever-changing user
requirements, to the intricate details on how to plan and guide the project to a suc-
cessful end—all this was presented with expertise and professionalism that are hard
to match. Considering that every bit of distilled truth Juval shared with us in class is
acquired, tested, and proven in real life, it transforms this learning experience into
a powerful body of knowledge that is an absolute necessity for anyone who aspires
to be a Software Architect.”
—Rossen Totev, software architect/project lead
“The Project Design Master Class is a career-changing event. Having come from an
environment where deadlines and budgets are almost pathologically abused, having
the opportunity to learn from Juval was a godsend. Piece by piece he provided the
parts and the appropriate tools for properly designing a project. The result is that
costs and timelines are kept in check in the dynamic and even chaotic environment
of modern software development. Juval says that you are going to engage in asym-
metric warfare against overdue and over cost, and you walk away truly feeling that
you have a gun to take to a knife fight. There is no magic—only the application of
basic engineering and manufacturing tenets to software—but you will go back to
your office feeling like a wizard.”
—Matt Robold, software development manager, West Covina Service Group
“Life changing. I feel like a tuned piano after collecting dust for a couple of decades.”
—Jordan Jan, CTO/architect
“The course was amazing. Easily this was the most intense but rewarding week of
my professional life.”
—Stoil Pankov, software architect
“Learning from Juval Löwy has changed my life. I went from being just a developer
to being a true software architect, applying engineering principals from other disci-
plines to design not just software, but also my career.”
—Kory Torgersen, software architect
“The Architect Master Class is a life lesson on skills and design—which I took twice.
It was so transformational the first time I attended that I wished I had taken this
class decades back, when I started my career. Even taking it for the second time has
only gotten me to 25% because the ideas are so profound. The required brain rewir-
ing and unlearning is really painful, but I needed to come back again with more of
my colleagues. Finally, every day that goes by I reflect back on what Juval said in
the classes and use that to help my teams implementing even the small things so that
we can all eventually call ourselves Professional Engineers. (P.S. I took 100 pages of
notes second time around!)”
—Jaysu Jeyachandran, software development manager, Nielsen
“If you are frustrated, lacking energy, and demotivated after seeing and experiencing
many failed attempts of our industry, the class is a boost of rejuvenation. It takes
you to the next level of professional maturity and also gives you the hope and con-
fidence that you can apply things properly. You will leave the Project Design Master
Class with a new mindset and enough priceless tools that will give you no excuse
to ever fail a software project. You get to practice, you get your hands dirty, you
get insight, and experience. Yes, you CAN be accurate when it is time to provide
your stakeholders with the cost, the time, and the risk of a project. Now, just don’t
wait for a company to send you to this class. If you are serious about your career,
you should hurry to take this or any IDesign Master Classes. It is the best self-
investment you can make. Thank the entire great team of IDesign for their continu-
ous efforts in helping the software industry become a solid engineering discipline.”
—Lucian Marian, software architect, Mirabel
“As someone in their late twenties, relatively early in their career, I can honestly say
that this course has changed my life and the way I view my career path. I honestly
expect this to be one of the most pivotal points of my life.”
—Alex Karpowich, software architect
“I wanted to thank you for a (professional) life-changing week. Usually I can’t sit at
class more than 50% of the time—it is boring and they don’t teach me anything I
couldn’t teach myself or already know. In the Architect’s Master Class I sat for nine
hours a day and couldn’t get enough of it: I learned what my responsibilities are as
an architect (I thought the architect is only the software designer), the engineering
aspect of software, the importance of delivering not only on time but also on budget
and on quality, not to wait to ‘grow’ to be an architect but to manage my career,
and how to quantify and measure what I previously considered as hunches. I have
much more insight from this week and many pieces are now in place. I can’t wait to
attend the next Master Class.”
—Itai Zolberg, software architect
RIGHTING SOFTWARE
This page intentionally left blank
RIGHTING SOFTWARE
A METHOD FOR SYSTEM
AND PROJECT DESIGN
Juval Löwy
Preface xxiii
About the Author xxxiii
Volatility-Based Decomposition 30
Decomposition, Maintenance, and Development 32
Universal Principle 33
Volatility-Based Decomposition and Testing 34
The Volatility Challenge 34
Identifying Volatility 37
Volatile versus Variable 37
Axes of Volatility 37
Solutions Masquerading as Requirements 40
Volatilities List 42
Example: Volatility-Based Trading System 42
Resist the Siren Song 47
Volatility and the Business 48
Design for Your Competitors 50
Volatility and Longevity 51
The Importance of Practicing 52
Chapter 3 Structure 55
Use Cases and Requirements 56
Required Behaviors 56
Layered Approach 58
Using Services 59
Typical Layers 60
The Client Layer 60
The Business Logic Layer 61
The ResourceAccess Layer 63
The Resource Layer 65
Utilities Bar 65
Classification Guidelines 65
What’s in a Name 65
The Four Questions 66
Managers-to-Engines Ratio 67
Key Observations 68
Subsystems and Services 70
Incremental Construction 70
About Microservices 72
C ONTENTS xiii
Chapter 4 Composition 83
Requirements and Changes 83
Resenting Change 84
Design Prime Directive 84
Composable Design 85
Core Use Cases 85
The Architect’s Mission 86
There Is No Feature 91
Handling Change 92
Containing the Change 92
Index 431
PREFACE
Hardly anyone gets into software development because they were forced into it.
Many literally fall in love with programming and decide to pursue it for a living.
And yet there is a vast gap between what most hoped their career would be like
and the dark, depressing reality of software development. The software industry
as a whole is in a deep crisis. What makes the crisis so acute is that it is multi-
dimensional; every aspect of software development is broken:
• Cost. There is weak correlation between the budget set for a project and what
it will actually cost to develop the system. Many organizations do not even try
to address the cost issue, perhaps because they simply do not know how, or
perhaps because doing so will force them to recognize they cannot afford the
system. Even if the cost of the fi rst version of a new system is justified, often the
cost across the life of the system is much higher than what it should have been
due to poor design and an inability to accommodate changes. Over time, main-
tenance costs become so prohibitive that companies routinely decide to wipe
the slate clean, only to end up shortly thereafter with an equally or even more
expensive mess as a new system. No other industry opts for a clean slate on a
regular basis simply because doing so does not make economic sense. Airlines
maintain jumbo jets for decades, and a house can be a century old.
• Schedule. Deadlines are often just arbitrary and unenforceable constructs
because they have little to do with the time it takes to actually develop the
system. For most developers, deadlines are these useless things whooshing by
as they plow ahead. If the development team does meet the deadline, everyone
xxiii
xxiv P REFACE
is surprised because the expectation is always for them to fail. This, too, is a
direct result of a poor system design that causes changes and new work to ripple
through the system and invalidate previously completed work. Moreover, it is
the result of a very inefficient development process that ignores both the depen-
dencies between activities and the fastest, safest way of building the system. Not
only is the time to market for the whole system exceedingly long, but the time
for a single feature may be just as inflated. It is bad enough when the project
slips its schedule; it is even worse when the slip was hidden from management
and customers since no one had any idea what the true status of the project was.
• Requirements. Developers often end up solving the wrong problems. There is
a perpetual communication failure between the end customers or their internal
intermediaries (such as marketing) and the development team. Most developers
also fail to accommodate their failure to capture the requirements. Even when
requirements are perfectly communicated, they will likely change over time. This
change invalidates the design and unravels everything the team tried to build.
• Staffing. Even modest software systems are so complex that they have exceeded
the capacity of the human brain to make sense of them. The internal and exter-
nal complexity is a direct result of poor system architecture, which in turn leads
to convoluted systems that are very difficult to maintain, extend, or reuse.
• Maintenance. Most software systems are not maintained by the same people
who developed them. The new staff does not understand how the system oper-
ates, and as a result they constantly introduce new problems as they try to solve
old ones. This quickly drives up the cost of maintenance and the time to market,
and leads to clean-slate efforts or canceled projects.
• Quality. Perhaps nothing is as broken with software systems as quality. Software
has bugs, and the word “software” is itself synonymous with “bugs.” Developers
cannot conceive of defect-free software systems. Fixing defects often increases
the defect count, as does adding features, or just plain maintenance. Poor qual-
ity is a direct result of a system architecture that does not lend itself to being
testable, understandable, or maintainable. Just as important, most projects do
not account for essential quality-control activities and fail to allocate enough
time for every activity to be completed in an impeccable manner.
Decades ago, the industry started developing software to solve the world’s prob-
lems. Today, software development itself is a world-class problem. The problems
of software development frequently manifest themselves in nontechnical ways
such as a high-stress working environment, high turnover rate, burnout, lack of
trust, low self-esteem, and even physical illness.
P REFACE xxv
None of the problems in software development is new.1 Indeed, some people have
spent their entire careers in software development without seeing software done
right even once. This leads them to believe that it simply cannot be done, and they
are dismissive of any attempt to address these issues because “that’s just the way
things are.” They may even fight those who are trying to improve software devel-
opment. They have already concluded that this goal is impossible, so anyone who is
trying to get better results is trying to do the impossible, which insults their intellect.
In recent years, I have noticed that the industry’s problems are getting worse.
More and more software projects fail. These failures are getting more expensive
in both time and money, and even completed projects tend to stray further afield
from their original commitments. The crisis is worsening not just because the
systems are getting bigger or because of the cloud, aggressive deadlines, or higher
rate of change. I suspect the real reason is that the knowledge of how to design
and develop software systems is slowly fading from within the development ranks.
Once, most teams had a veteran who mentored the young and handed down the
tribal knowledge. Nowadays these mentors have moved on or are retiring. In
their absence, the rank and file is left with access to infi nite information but zero
knowledge.
1. Edsger W. Dijkstra, “The Humble Programmer: ACM Turing Lecture,” Communications of the ACM
15, no. 10 (October 1972): 859–866.
xxvi P REFACE
I wish there was just one thing you could do to fi x the software crisis such as using
a process, a development methodology, a tool, or a technology. Unfortunately,
to fi x a multidimensional problem, you need a multidimensional solution. In this
book I offer a unified remedy: righting software.
In the abstract, all I suggest is to design and develop software systems using engi-
neering principles. The good news is that there is no need to reinvent the wheel.
Other engineering disciplines are quite successful, so the software industry can
borrow their key universal design ideas and adapt them to software. You will see
in this book a set of fi rst principles in software engineering, as well as a compre-
hensive set of tools and techniques that apply to software systems and projects. To
succeed, you have to assume an engineering perspective. Ensuring that the soft-
ware system is maintainable, extensible, reusable, affordable, and feasible in terms
of time and risk are all engineering aspects, not technical aspects. These engineer-
ing aspects are traced directly to the design of the system and the project. Since the
term software engineer largely refers to a software developer, the term software
architect has emerged to describe the person in the team who owns all the design
aspects of the project. Accordingly, I refer to the reader as a software architect.
The ideas in this book are not the only things you will need to get right, but they
certainly are a good start because they treat the root cause of the problems men-
tioned earlier. That root cause is poor design, be it of the software system itself
or of the project used to build that system. You will see that it is quite possible to
deliver software on schedule and on budget and to design systems that meet every
conceivable requirement. The results are also systems that are easy to maintain,
extend, and reuse. I hope that by practicing these ideas you will right not just your
system but your career and rekindle your passion for software development.
In most technical books, each chapter addresses a single topic and discusses it in
depth. This makes the book easier to write, but that is typically not how people
learn. In contrast, in this book, the teaching is analogous to a spiral. In both parts
of the book, each chapter reiterates ideas from the previous chapters, going deeper
P REFACE xxvii
or developing ideas using additional insight across multiple aspects. This mimics
the natural learning process. Each chapter relies on those that preceded it, so you
should read the chapters in order. Both parts of the book include a detailed case
study that demonstrates the ideas as well as additional aspects. At the same time,
to keep the iterations concise, as a general rule I usually avoid repeating myself, so
even key points are discussed once.
Chapter 3, Structure
Chapter 3 improves on the ideas of Chapter 2 by introducing structure. You will
see how to capture requirements, how to layer your architecture, the taxonomy
of the components of the architecture, their interrelationships, specific classifi-
cation guidelines, and some related issues such as subsystems design.
Chapter 4, Composition
Chapter 4 shows how to assemble the system components into a valid composi-
tion that addresses the requirements. This short chapter contains several of the
book’s key design principles, and it leverages the previous two chapters into a
powerful mental tool you will use in every system.
APPENDICES
Appendix A, Project Tracking
Appendix A shows you how to track the project’s progress with regard to the
plan and how to take corrective actions when needed. Project tracking is more
about project management than it is about project design, but it is crucial in
assuring you meet your commitments once the work starts.
The techniques and ideas for the book apply regardless of programming language
(such as C++, Java, C#, and Python), platform (such as Windows, Linux, mobile,
on-premise, and cloud), and project size (from the smallest to the largest projects).
They also cross all industries (from healthcare to defense), all business models,
and company sizes (from the startup to the large corporation).
The most important assumption I have made about the reader is that you care
about what you do, at a deep level, and the current failures and waste distresses
you. You want to do better but lack guidance or are confused by bad practices.
Bold
Used for defi ning terms and concepts.
Directive
Used for fi rst principles, design rules, or key guidance and advice.
Reserved Words
Used for when referring to reserved words of the methodology.
P REFACE xxxi
ACKNOWLEDGMENTS
Let me start by thanking the two who urged me to write the book, each in their
own unique way: Gad Meir and Jarkko Kemppainen.
Thanks go to the development editor and sounding board, Dave Killian: Any more
editing and I would have to list you as a co-author. Next, thanks to Beth Siron
for reviewing the raw manuscript. The following people contributed their time
by reviewing the draft: Chad Michel, Doug Durham, George Stevens, Josh Loyd,
Riccardo Bennett-Lovsey, and Steve Land.
xxxiii
This page intentionally left blank
For the beginner architect, there are many options.
1
This page intentionally left blank
1
THE METHOD
The Zen of Architects1 simply states that for the beginner architect, there are
many options of doing pretty much anything. For the master architect, however,
there are only a few good options, and typically only one.
Beginner architects are often perplexed by the plethora of patterns, ideas, meth-
odologies, and possibilities for designing their software system. The software
industry is bursting at the seams with ideas and people eager to learn and improve
themselves, including you who are reading this book. However, since there are so
few correct ways of doing any given design task, you might as well focus only on
those and ignore the noise. Master software architects know to do just that; as if
by supernatural inspiration, they immediately zoom in and yield the correct design
solution.
The Zen of Architects applies not just to the system design but also to the project
that builds it. Yes, there are countless ways of structuring the project and assigning
work to the team members, but are they all equally safe, fast, cheap, useful, effec-
tive, and efficient? The master architect also designs the project to build the system
and even helps management decide if they can afford the project in the fi rst place.
True mastery of any subject is a journey. With very few exceptions, no one is a
born expert. My own career is a case in point. I started as a junior architect almost
30 years ago when the term architect was not commonly used within software
organizations. Moving on fi rst as a project architect then as a division architect,
by the late 1990s I was the chief software architect of a Fortune 100 company in
Silicon Valley. In 2000, I founded IDesign as a company solely dedicated to soft-
ware design. At IDesign, we have since designed hundreds of systems and proj-
ects. While each engagement had its own specific architecture and project plan, I
observed that no matter the customer, the project, the system, the technology, or
the developers, my design recommendations were, in the abstract, the same.
1. https://en.wikipedia.org/wiki/Zen_Mind,_Beginner's_Mind
3
4 C HAPTER 1 THE M ETHOD
The answer to the second question is a resounding affirmative. I call the result The
Method, and it is the subject of this book. Having applied The Method across a
multitude of projects, having taught and mentored a few thousand architects the
world over, I can attest that, when applied properly, it works. I am not discounting
here the value of having a good attitude, technical skills, and analytical capabil-
ities. These are necessary ingredients for success regardless of the methodology
you use. Sadly, these ingredients are insufficient; I often see projects fail despite
having people with all these great qualities and attributes. However, when com-
bined with The Method, these ingredients give you a fighting chance. By ground-
ing your design on sound engineering principles, you will learn to steer clear of the
misguided practices and false intuition that are the prevailing wisdom.
With system design, The Method lays out a way of breaking down a big system
into small modular components. The Method offers guidelines for the structure,
role, and semantics of the components and how these components should interact.
The result is the architecture of the system.
With project design, The Method helps you provide management with several
options for building the system. Each option is some combination of schedule,
cost, and risk. Each option also serves as the system assembly instructions, and it
sets up the project for execution and tracking.
Project design is the second part of the book and is far more important for success
than system design. Even a mediocre system design can succeed if the project has
adequate time and resources and if the risk is acceptable. However, even a world-
class system design will fail if the project has inadequate time or resources to build
the system or if the project is too risky. Project design is also more intricate than
system design and, as such, requires additional tools, ideas, and techniques.
WHAT I S THE M ETHOD ? 5
Because it combines system and project design, The Method is actually a design
process. Over the years, the software industry has given great attention to the
development process but has devoted little attention to the design process. This
book aims at filling this gap.
DESIGN VALIDATION
Design validation is critical because an organization should not risk having a team
start developing against an inadequate architecture or developing a system the
organization cannot afford to build. The Method supports and enables this critical
task, allowing the architect to assert with reasonable confidence that the proposed
design is adequate; that is, the design fulfi lls two key objectives. First, the design
must address the customer requirements. Second, the design must address the orga-
nization or the team capabilities and constraints.
Once the coding starts, changing the architecture is often unacceptable due to the
cost and schedule implications. In practice, this means that without system design
validation, there is risk of locking in, at best, an imperfect architecture and, at
worst, a monstrosity. The organization will have to try to live with the resulting
system for the next few years and several versions until the next big rewrite. A
poorly designed software system may seriously damage the business, depriving it
of the ability to respond to business opportunities, and may even fi nancially ruin
it with escalating software maintenance costs.
Early validation of the design is imperative. For example, discovering three years
after the work started that a particular idea or the whole architecture was wrong is
intellectually interesting but of no practical value. Ideally, one week into the proj-
ect, you must know if the architecture is going to hold water (or not). Anything
longer runs the risk of commencing development with a questionable architecture.
The following chapters describe precisely how to validate a system design.
Note that I am referring here to the system design, the architecture, not the detailed
design of the system. Detailed design produces for each component in the archi-
tecture the key implementation artifacts, such as interfaces, class hierarchies, and
data contracts. Detailed design takes longer to produce, can be done during the
project execution, and may change as the system is constructed or evolved.
Similarly, you must validate your project design. Running out of time or running
over budget (or both) mid-project is simply unacceptable. Failing to meet your
commitments will limit your career. You must proactively validate your project
design to ensure that the team at hand can deliver the project.
6 C HAPTER 1 THE M ETHOD
In addition to providing the architecture and project plans, the objective of The
Method is to remove design as a risk to the project. No project should fail because
the architecture was too complex for developers to build and maintain. The
Method discovers the architecture efficiently and effectively and does so in a short
period of time. The same benefit applies to project design. No project should fail
because it did not have enough time or resources from the start. This book shows
you how to accurately calculate the project duration and costs and how to drive
educated decisions.
TIME CRUNCH
Using The Method, you can produce an entire system design in mere days, typi-
cally in three to five days, with project design taking similar time. Given the lofty
goals of the effort, namely, producing the system architecture and the project plans
for a new system, the duration may look too short. Typical business systems get
the option of a new design only every few years. Why not spend 10 days on the
architecture? Measured against a system lifetime of years, five additional days are
not even a rounding error. However, adding design time often does not improve
the result and can even be detrimental.
The time crunch also helps avoid design gold plating. Parkinson’s law2 states that
work always expands to fill the allotted time. Given 10 days to complete a design
that could be completed in five days, the architect will likely work on the design
for 10 days. The architect will use the extra time to design frivolous aspects that
add nothing but complexity, disproportionally increasing the cost of implementa-
tion and maintenance for years to come. Limiting the design time forces you to
produce a good-enough design.
ELIMINATING ANALYSIS-PARALYSIS
Analysis-paralysis is a predicament that occurs when someone (or a group) who is
otherwise capable, clever, and even hardworking (as are most software architects)
is stuck in a seemingly endless cycle of analysis, design, new revelations, and back
to more analysis. The person or group is effectively paralyzed and precluded from
producing any productive outcome.
When the person or group in charge of producing the design is unaware of the
correct decision tree, they start at some place other than the root of the tree.
Invariably, at some point, a downstream design decision will invalidate a prior
decision; all decisions made in between these two points will be invalid. Designing
this way is akin to performing a bubble sort of the design decision tree. Since
bubble sort roughly involves as many operations as the square of the number of
elements involved, the penalty is severe. A simple software system requiring some
20 system and project design decisions potentially has 400 design iterations if you
do not follow the decision tree. Going through so many meetings (even if you
spread it over time) is paralysis. Being given the time to perform even 40 iterations
is unlikely. When the system and project design effort is out of time, development
will commence with the system and the project in an immature state. This defers
discoveries that invalidate the design decisions to an even worse point in the future
when time, effort, and artifacts already are associated with the incorrect choices.
In essence, you have maximized the cost of the incorrect design decision.
The Method provides the decision tree of a typical business system both for the
system design and for the project design. Only after you have designed the system
8 C HAPTER 1 THE M ETHOD
is there any point in designing the project to build that system. Each of these
design efforts, both the system and the project, has its own subtree of design deci-
sions. The Method guides you through it, starting at the root, avoiding rework and
reevaluation of prior decisions.
One of the most valuable techniques in pruning the decision tree is the application
of constraints. As pointed by Dr. Fredrick Brooks, 3 contrary to common wisdom
or intuition, the worst design problem is a clean canvas. Without constraints,
the design should be easy, right? Wrong. The clean canvas should terrify every
architect. There are infi nite ways of getting it wrong or going against unstated
constraints. The more constraints there are, the easier the design task is. The less
leeway allowed, the more obvious and clear the design. In a totally constrained
system, there is nothing to design: it is what it is. Since there are always constraints
(whether explicit or implicit), by following the design decision tree, The Method
places increasing constraints on the system and the project, to the point that the
design converges and resolves quickly.
COMMUNICATION
An important advantage of The Method is in communicating design ideas. Once
participants are familiar with the structure of the architecture and the design
semantics, The Method enables sharing design ideas and precisely conveying what
the design requires. You can communicate the thought process behind the design
to the team. You should share the tradeoffs and insights that guided you in the
architecture, documenting in an unambiguous way the operational assumptions
and the resulting design decisions.
This level of clarity and transparency in design intent is critical for architecture
survival. A good design is one that was well conceived, survived through devel-
opment, and ended up as working bits on customer machines. You must be able
to communicate the design to the developers and ensure that they value the intent
and the concepts behind the design. You must enforce the design by using reviews,
inspection, and mentoring. The Method excels at this type of communication
because of the combination of well-defi ned service semantics and structure.
Rest assured that if the developers who are tasked with building the system do not
understand and value the design, they will butcher it. No amount of design or code
3. Frederick P. Brooks Jr., The Design of Design: Essays from a Computer Scientist (Upper Saddle River,
NJ: Addison-Wesley, 2010).
WHAT THE M ETHOD I S N OT 9
review can ever fi x that butchery. The purpose of reviews should be to catch the
unintended deviation from the architecture as early as possible.
The same holds true when it is time to communicate the project plan to proj-
ect managers, managers, or other stakeholders. Clear, unambiguous, comparable
options are key to educated decisions. When people make the wrong decisions, it
is often because they do not understand the project and have the wrong mental
model for how projects behave. By producing the correct models for the project
across time, cost, and risk, the architect can enable the right decision. The Method
provides the right vocabulary and metrics for communicating with decision mak-
ers in a simple and concise way. Once managers are exposed to the possibilities
of project design, they will become its greatest advocates and insist on working
that way. No amount of passionate arguments can accomplish what a simple set
of charts and numbers can. Moreover, project design is important not only at
the beginning of the project. As the work commences, you can use the tools of
project design to communicate to management the effect and viability of changes.
Appendix A discusses project tracking and managing changes.
The Method does not take away the architect’s creativity and effort in producing
the right architecture. The architect is still responsible for distilling the required
behavior of the system. The architect is still liable for getting the architecture
wrong or for failing to communicate the design to developers or for failing to lead
the development effort until delivery without compromising the architecture, all
in the face of mounting pressure. Furthermore, as illustrated in the second part of
this book, the architect must produce a viable project design, stemming out of the
architecture. The architect must calibrate the project to the available resources, to
4. Frederick P. Brooks Jr., “No Silver Bullet: Essence and Accidents of Software Engineering,” Computer
20, no. 4 (April 1987).
10 C HAPTER 1 THE M ETHOD
what the resources can produce, to the risks involved, and to the deadline. Going
through the motions of project design for their own sake is pointless. The architect
must eliminate any bias and produce the correct set of planning assumption and
resulting calculations.
The Method provides a good starting point for system and project design, along
with a list of the things to avoid. However, The Method works only as long as you
do it truthfully while devoting the time and mental energy to gather the required
information. You must fundamentally care about the design process and what it
produces.
Part I
SYSTEM DESIGN
11
This page intentionally left blank
2
DECOMPOSITION
Software architecture is the high-level design and structure of the software system.
While designing the system is quick and inexpensive compared with building the
system, it is critical to get the architecture right. Once the system is built, if the
architecture is defective, wrong, or just inadequate for your needs, it is extremely
expensive to maintain or extend the system.
The essence of the architecture of any system is the breakdown of the concept of the
system as a whole into its comprising components, be it a car, a house, a laptop, or a
software system. A good architecture also prescribes how these components interact
at run-time. The act of identifying the constituent components of a system is called
system decomposition.
In years past, these building blocks were C++ objects and later COM, Java, or
.NET components. In a modern system and in this book, services (as in service-
orientation) are the most granular unit of the architecture. However, the tech-
nology used to implement the components and their details (such as interfaces,
operations, and class hierarchies) are detailed design aspects, not system decom-
position. In fact, such details can change without ever affecting the decomposition
and therefore the architecture.
Unfortunately, the majority, if not the vast majority, of all software systems are
not designed correctly and arguably are designed in the worst possible way. The
design flaws are a direct result of the incorrect decomposition of the systems.
This chapter therefore starts by explaining why the common ways of decomposi-
tion are flawed to the core and then discusses the rationale behind The Method’s
decomposition approach. You will also see some powerful and helpful techniques
to leverage when designing the system.
13
14 C HAPTER 2 D ECOMPOSITION
Precluding Reuse
Consider a simple functionally decomposed system that uses three services A,
B, and C, which are called in the order of A then B then C. Because functional
decomposition is also decomposition based on time (call A and then call B), it
effectively precludes individual reuse of services. Suppose another system also
needs a B service (such as Billing). Built into the fabric of B is the notion that
it was called after an A and before a C service (such as fi rst Invoicing, and only
then Billing against an invoice, and fi nally Shipping). Any attempt to lift the
B service from the fi rst system and drop it in the second system will fail because,
in the second system, no one is doing A before it and C after it. When you lift the B
service, the A and the C services are hanging off it. B is not an independent reusable
service at all—A, B, and C are a clique of tightly coupled services.
Functional decomposition, therefore, tends to make services either too big and too
few or too small and too many. You often see both afflictions side by side in the
same system.
Client
A B C
By bloating the client with the orchestration logic, you pollute the client code with
the business logic of the system. The client is no longer just about invoking opera-
tions on the system or presenting information to users. The client is now intimately
aware of all internal services, how to call them, how to handle their errors, how to
16 C HAPTER 2 D ECOMPOSITION
compensate for the failure of B after the success of A, and so on. Calling the ser-
vices is almost always synchronous because the client proceeds along the expected
sequence of A then B then C, and it is difficult otherwise to ensure the order of the
calls while remaining responsive to the outside world. Furthermore, the client is
now coupled to the required functionality. Any change in the operations, such as
calling B' instead of B, forces the client to reflect that change. The hallmark of a
bad design is when any change to the system affects the client. Ideally, the client
and services should be able to evolve independently. Decades ago, software engi-
neers discovered that it was a bad idea to include business logic with the client.
Yet, when designed as in Figure 2-1, you are forced to pollute the client with the
business logic of sequencing, ordering, error compensation, and duration of the
calls. Ultimately, the client is no longer the client—it has become the system.
What if there are multiple clients (e.g., rich clients, web pages, mobile devices),
each trying to invoke the same sequence of functional services? You are destined
to duplicate that logic across the clients, making maintenance of all those clients
wasteful and expensive. As the functionality changes, you now are forced to keep
up with that change across multiple clients, since all of them will be affected.
Often, once that is the case, developers try to avoid any changes to the functional-
ity of the services because of the cascading effect it will have on the clients. With
the multiplicity of clients, each with its own version of the sequencing tailored to
its needs, it becomes even more challenging to change or interchange services, thus
precluding reuse of the same behavior across the clients. Effectively, you end up
maintaining multiple complex systems, trying to keep them all in sync. Ultimately,
this leads to both stifling of innovation and increased time to market when the
changes are forced through development and production.
FormSetup
Resources
MainForm
Ideally, Resources should have been trivial, comprising simple lists of images
and strings. The rest of the system is made up of dozens of small, simple classes,
each devoted to a particular functionality. The smaller classes are literally in the
shadow of the three massive ones. However, while each of the small classes may
be trivial, the sheer number of the smaller classes is a complexity issue all on its
own, involving intricate integration across that many classes. The result is both
too many components and too big components as well as a bloated client.
The advantage of doing so is that you get to keep the clients simple and even
asynchronous: the clients issue the call to the A service. The A service then calls
B, and B calls C.
18 C HAPTER 2 D ECOMPOSITION
Client
A B C
The problem now is that the functional services are coupled to each other and to
the order of the functional calls. For example, you can call the Billing service
only after the Invoicing service but before the Shipping service. In the case
of Figure 2-3, built into the A service is the knowledge that it needs to call the B
service. The B service can be called only after the A service and before the C ser-
vice. A change in the required ordering of the calls is likely to affect all services up
and down the chain because their implementation will have to change to reflect
the new required order.
But Figure 2-3 does not reveal the full picture. The B service of Figure 2-3 is dras-
tically different from that of Figure 2-1. The original B service performed only the
B functionality. The B service in Figure 2-3 must be aware of the C service, and the B
contract must contain the parameters that will be required by the C service to perform
its functionality. These details were the responsibility of the client in Figure 2-1.
The problem is compounded by the A service, which must now accommodate in
its service contract the parameters required for calling the B and the C services for
them to perform their respective business functionality. Any change to the B and C
functionality is reflected in a change to the implementation of the A service, which is
now coupled to them. This kind of bloating and coupling is depicted in Figure 2-4.
Client
A B C
Sadly, even Figure 2-4 does not tell the whole truth. Suppose the A service per-
formed the A functionality successfully and then proceeded to calling the B service
to perform the B functionality. The B service, however, encountered an error and
failed to execute properly. If A called B synchronously, then A must be intimately
aware of the internal logic and state of B in order to recover its error. This means
the B functionality must also reside in the A service. If A called B asynchronously,
then the B service must now somehow reach back to the A service and undo the
A functionality or contain the rollback of A within itself. In other words, the A
functionality also resides in the B service. This creates tight coupling between the
B service and the A service and bloats the B service with the need to compensate
for the success of the A service. This situation is shown in Figure 2-5.
Client
A B C
Figure 2-5 Additional bloating and coupling due to compensation
The issue is compounded in the C service. What if both the A and B functionalities
succeeded and completed, but the C service failed to perform its business function?
The C service must reach back to both the B and the A services to undo their oper-
ations. This creates far more bloating in the C service and couples it to the A and B
services. Given the coupling and bloating in Figure 2-5, what will it take to replace
the B service with a B' service that performs the functionality differently than
B? What will be the adverse effects on the A and C services? Again, what degree
of reuse exists in Figure 2-5 when the functionality in the services is asked for in
other contexts, such as calling the B service after the D service and before the E
service? Are A, B, and C three distinct services or just one fused mess?
required functionalities and then create a component in your architecture for each.
Functional decomposition (and its kin, the domain decomposition discussed later)
is how most systems are designed. Most people choose functional decomposition
naturally, and it is likely what your computer science professor showed you in
school. The prevalence of functional decomposition in poorly designed systems
makes a near-perfect indicator of something to avoid. At all costs, you must resist
the temptations of functional decomposition.
Design, by its very nature, is a high-added-value activity. You are reading this
book instead of yet another programming book because you value design, or put
differently, you think design adds value, or even a lot of value.
The problem with functional decomposition is that it endeavors to cheat the fi rst
law of thermodynamics. The outcome of a functional decomposition, namely,
system design, should be a high-added-value activity. However, functional decom-
position is easy and straightforward: given a set of requirements that call for
performing the A, B, and C functionalities, you decompose into the A, B, and
C services. “No sweat!” you say. “Functional decomposition is so easy that a
tool could do it.” However, precisely because it is a fast, easy, mechanistic, and
straightforward design, it also manifests a contradiction to the fi rst law of ther-
modynamics. Since you cannot add value without effort, the very attributes that
make functional decomposition so appealing are those that preclude functional
decomposition from adding value.
The second is to perform an anti-design effort. Inform the team that you are con-
ducting a design contest for the next-generation system. Split the team into halves,
AVOID F UNCTIONAL D ECOMPOSITION 21
each in a separate conference room. Ask the fi rst half to produce the best design
for the system. Ask the second half to produce the worst possible design: a design
that will maximize your inability to extend and maintain the system, a design
that will disallow reuse, and so on. Let them work on it for one afternoon and
then bring them together. When you compare the results, you will usually see they
have produced the same design. The labels on the components may differ, but
the essence of the design will be the same. Only now confess that they were not
working on the same problem and discuss the implications. Perhaps a different
approach is called for this time.
.
.
.
While Figure 2-6 is already preposterous, the true insanity becomes evident only
when it is time to build this house. You start with a clean plot of land and build
cooking. Just cooking. You take a microwave oven out of its box and put it aside.
Pour a small concrete pad, build a wood frame on the pad, cover it with counter-
top, and place the microwave on it. Build a small pantry for the microwave and
hammer a tiny roof over it, connect just the microwave to the power grid. “We
have cooking!” you announce to the boss and customers.
But is cooking really done? Can cooking ever be done this way? Where are you
serving the meal, storing the leftovers, or disposing of trash? What about cook-
ing over the gas stove? What will it take to duplicate this feat for cooking over
22 C HAPTER 2 D ECOMPOSITION
the stove? What degree of reuse can you have between the two separate ways of
expressing the functionality of cooking? Can you extend any one of them easily?
What about cooking with a microwave somewhere else? What does it take to
relocate the microwave? All of this mess is not even the beginning because it all
depends on the type of cooking you perform. You need to build separate cooking
functionality, perhaps, if cooking involves multiple appliances and differs by con-
text—for example, if you are cooking breakfast, lunch, dinner, dessert, or snacks.
You end up with either explosion of minute cooking services, each dedicated to
a specific scenario that must be known in advance, or you end up with massive
cooking service that has it all. Will you ever build a house like that? If not, why
design and build a software system that way?
In fact, every one of the functional areas of Figure 2-6 can be mapped to domains
in Figure 2-7, which presents severe problems. While each bedroom may be unique,
you must duplicate the functionality of sleeping in all of them. Further duplica-
tion occurs when sleeping in front of the TV in the living room or when enter-
taining guests in the kitchen (as almost all house parties end up in the kitchen).
AVOID F UNCTIONAL D ECOMPOSITION 23
Each domain often devolves into an ugly grab bag of functionality, increasing
the internal complexity of the domain. The increased inner complexity causes
you to avoid the pain of cross-domain connectivity, and communication across
domains is typically reduced to simple state changes (CRUD-like) rather than
actions triggering required behavior execution involving all domains. Composing
more complex behaviors across domains is very difficult. Some functionalities are
simply impossible in such domain decompositions. For example, in the house in
Figure 2-7, where would you perform cooking that cannot take place in the kitchen
(e.g., a barbecue)?
Then you move on to the bedroom. You fi rst bust the stucco off the kitchen walls
to expose the bolts connecting the walls to the foundation and unbolt the kitchen
24 C HAPTER 2 D ECOMPOSITION
from the foundation. You disconnect the kitchen from the power supply, gas sup-
ply, water supply, and sewer discharge and then use expensive hydraulic jacks to
lift the kitchen. While suspending the kitchen in midair, you shift it to the side so
that you can demolish the foundation for the kitchen with jackhammers, hauling
the debris away and paying expensive dump fees. Now you can dig a new trench
that will contain a continuous foundation for the bedroom and the kitchen. You
pour concrete into the trenches to cast the new foundation and add the bolts hope-
fully at exactly the same spots as before. Next, you very carefully lower the kitchen
back on top of the new foundation, making sure all the bolt holes align (this is next
to impossible). You erect new walls for the bedroom. You temporarily remove the
cabinets from the kitchen walls; remove the drywall to expose the inner electrical
wires, pipes, and ducts; and connect the ducts, plumbing, and wires to those of
the bedroom. You add drywall in the kitchen and the bedroom, rehang the kitchen
cabinets, and add closets in the bedroom. You knock down any remaining stucco
from the walls of the kitchen so that you can apply continuous, crack-free stucco
on the outside walls. You must convert several of the previous outside walls of the
kitchen to internal walls now, with implications on stucco, insulation, paint, and
so on. You remove the roof of the kitchen and build a new continuous roof over
the bedroom and the kitchen. You announce to the customer that milestone 2.0 is
met, and Bedroom 1 is done.
The fact that you had to rebuild the kitchen is not disclosed. The fact that building
the kitchen the second time around was much more expensive and riskier than the
fi rst time is also undisclosed. What will it take to add another bedroom to this
house? How many times will you end up building and demolishing the kitchen?
How many times can you actually rebuild the kitchen before it crumbles into a
shifting pile of useless debris? Was the kitchen really done when you announced it
so? Rework penalties aside, what degree of reuse is there between the various parts
of the house? How much more expensive is building a house this way? Why would
it make sense to build a software system this way?
FAULTY MOTIVATION
The motivation for functional or domain decomposition is that the business or
the customer wants its feature as soon as possible. The problem is that you can
never deploy a single feature in isolation. There is no business value in Billing
independent from Invoicing and Shipping.
The situation is even worse when legacy systems are involved. Rarely do develop-
ers get the privilege of a completely new, green-field system. Most likely there is an
existing, decaying system that was designed functionally whose inflexibility and
maintenance costs justify the new system.
AVOID F UNCTIONAL D ECOMPOSITION 25
The sad reality is that unit testing is borderline useless. While unit testing is an
essential part of testing, it cannot really test a system. Consider a jumbo jet that
has numerous internal components (pumps, actuators, servos, gears, turbines,
etc.). Now suppose all components have independently passed unit testing per-
fectly, but that is the only testing that took place before the components were
assembled into an aircraft. Would you dare board that airplane? The reason unit
testing is so marginal is that in any complex system, the defects are not going to be
in any of the units but rather are the result of the interactions between the units.
This is why you instinctively know that, while each component in the jumbo jet
example works, the aggregate could be horribly wrong. Worse, even if the complex
system is at a perfect state of impeccable quality, changing a single, unit-tested
component could break some other unit(s) relying on an old behavior. You must
1. https://en.wikipedia.org/wiki/Streetlight_effect
26 C HAPTER 2 D ECOMPOSITION
repeat testing of all units when changing a single unit. Even then it would be mean-
ingless because the change to one of the components could affect some interaction
between other components or a subsystem, which no unit testing could discover.
The only way to verify change is full regression testing of the system, its subsys-
tems, its components and interactions, and fi nally its units. If, as a result of your
change, other units need to change, the effect on regression testing is nonlinear.
The inefficacy of unit testing is not a new observation and has been demonstrated
across thousands of well-measured systems.
I fi nd that not only can the industry borrow from the physical world experi-
ence and best practices; it must do so. Contrary to intuition, software requires
design even more than physical systems do. The reason is simple: complexity.
The complexity of physical systems such as typical houses is capped by phys-
ical constraints. You cannot have a poorly designed house with hundreds of
interconnecting corridors and rooms. The walls will either weigh too much,
have too many openings, be too thin, have doors too small, or cost too much to
assemble. You cannot use too much building materials because the house will
implode, or you will not have the cash flow to buy them or a place to store the
extra material on-site.
AVOID F UNCTIONAL D ECOMPOSITION 27
• The users of the system utilize a browser to connect to the system and manage
connected sessions, completing a form and submitting the request.
• After a trade, report, or analysis request, the system sends an email to the users
confi rming their request or containing the results.
• The data should be stored in a local database.
A straightforward functional decomposition would yield the design of Figure 2-8.
Trading
Web Portal
Buying Trade
Reporting
Stocks Scheduling
Selling
Analyzing
Stocks
Trades Reports
DB DB
What will it take to change the client from a web portal to a mobile device? Would
that not mean duplicating the business logic into the mobile device? It is likely that
AVOID F UNCTIONAL D ECOMPOSITION 29
little of the business logic and the effort invested in developing it for the web client
can be salvaged and reused in the mobile client because it is embedded in the web
portal. Over time, the developers will end up maintaining several versions of the
business logic in multiple clients.
Per the design decision, the data is stored in a database, and Buying Stocks,
Selling Stocks, Trade Scheduling, Reporting, and Analyzing all
access that database. Now suppose you decide to move the data storage from
the local database to a cloud-based solution. At the very least, this will force you
to change the data-access code in Buying Stocks, Selling Stocks, Trade
Scheduling, Reporting, and Analyzing to go from a local database to a
cloud offering. The way you structure, access, and consume the data has to change
across all components.
What if the client wishes to interact with the system asynchronously, issuing a few
trades and collecting the results later? You built the components with the notion
of a connected, synchronous client that orchestrates the components. You will
likely need to rewrite Buying Stocks, Selling Stocks, Trade Scheduling,
Reporting, and Analyzing activities to orchestrate each other, along the lines
of Figure 2-5.
Even without branching to commodities, what if you must localize the application
to foreign markets? At the very least, the client will need a serious makeover to
30 C HAPTER 2 D ECOMPOSITION
accommodate language localization, but the real effect is going to be the system
components again. Foreign markets are going to have different trading rules, reg-
ulations, and compliance requirements, drastically affecting what the system is
allowed to do and how it is to go about trading. This will mean much rework to
Buying Stocks, Selling Stocks, Trade Scheduling, Reporting, and
Analyzing whenever entering a new locale. You are going to end up with either
bloated god services that can trade in any market or a version of the system for
each deployment locale.
Finally, all components presently connect to some stock ticker feed that provides
them with the latest stock values. What is required to switch to a new feed provider
or to incorporate multiple feeds? At the very least, Buying Stocks, Selling
Stocks, Trade Scheduling, Reporting, and Analyzing will require work
to move to a new feed, connect to it, handle its errors, pay for its service, and so on.
There are also no guarantees that the new feed uses the same data format as the old
one. All components require some conversion and transformation work as well.
VOLATILITY-BASED DECOMPOSITION
The Method’s design directive is:
When you use volatility-based decomposition, you start thinking of your system
as a series of vaults, as in Figure 2-9.
Any change is potentially very dangerous, like a hand grenade with the pin pulled
out. Yet, with volatility-based decomposition, you open the door of the appropri-
ate vault, toss the grenade inside, and close the door. Whatever was inside the vault
may be destroyed completely, but there is no shrapnel flying everywhere, destroy-
ing everything in its path. You have contained the change.
Change
of the decomposition, it affects multiple (if not most) of the components in your
architecture. Functional decomposition therefore tends to maximize the effect of
the change. Since most software systems are designed functionally, change is often
painful and expensive, and the system is likely to resonate with the change. Changes
made in one area of functionality trigger other changes and so on. Accommodating
change is the real reason you must avoid functional decomposition.
All the other problems with functional decomposition pale when compared with
the poor ability and high cost of handling change. With functional decomposition,
a change is like swallowing a live hand grenade.
What you choose to encapsulate can be functional in nature, but hardly ever is
it domain-functional, meaning it has no meaning for the business. For example,
the electricity that powers a house is indeed an area of functionality but is also an
important area to encapsulate for two reasons. The fi rst reason is that power in
a house is highly volatile: power can be AC or DC; 110 volts or 220 volts; single
phase or three phases; 50 hertz or 60 hertz; produced by solar panels on the roof,
a generator in the backyard, or plain grid connectivity; delivered on wires with
different gauges; and on and on. All that volatility is encapsulated behind a recep-
tacle. When it is time to consume power, all the user sees is an opaque receptacle,
encapsulating the power volatility. This decouples the power-consuming appli-
ances from the power volatility, increasing reuse, safety, and extensibility while
reducing overall complexity. It makes using power in one house indistinguishable
from using it in another, highlighting the second reason it is valid to identify power
as something to encapsulate in the house. While powering a house is an area of
32 C HAPTER 2 D ECOMPOSITION
functionality, in general, the use of power is not specific to the domain of the house
(the family living in the house, their relationships, their wellbeing, property, etc.).
What would it be like to live in a house where the power volatility was not encap-
sulated? Whenever you wanted to consume power, you would have to fi rst expose
the wires, measure the frequency with an oscilloscope, and certify the voltage
with a voltmeter. While you could use power that way, it is far easier to rely on the
encapsulation of that volatility behind the receptacle, allowing you instead to add
value by integrating power into your tasks or routine.
Even before maintenance ever starts, when the system is under development, func-
tional decomposition harbors danger. Requirements will change throughout devel-
opment (as they invariably do), and the cost of each change is huge, affecting multiple
areas, forcing considerable rework, and ultimately endangering the deadline.
UNIVERSAL PRINCIPLE
The merits of volatility-based decomposition are not specific to software systems.
They are universal principles of good design, from commerce to business interac-
tions to biology to physical systems and great software. Universal principles, by
their very nature, apply to software too (else they would not be universal). For
example, consider your own body. A functional decomposition of your own body
would have components for every task you are required to do, from driving to
programming to presenting, yet your body does not have any such components.
You accomplish a task such as programming by integrating areas of volatility.
For example, your heart provides an important service for your system: pumping
blood. Pumping blood has enormous volatility to it: high blood pressure and low
pressure, salinity, viscosity, pulse rate, activity level (sitting or running), with and
without adrenaline, different blood types, healthy and sick, and so on. Yet all that
volatility is encapsulated behind the service called the heart. Would you be able to
program if you had to care about the volatility involved in pumping blood?
You can also integrate into your implementation external areas of encapsulated
volatility. Consider your computer, which is different from literally any other com-
puter in the world, yet all that volatility is encapsulated. As long as the computer
can send a signal to the screen, you do not care what happens behind the graphic
port. You perform the task of programming by integrating encapsulated areas of
volatility, some internal, some external. You can reuse the same areas of volatility
(such as the heart) while performing other functionalities such as driving a car or
presenting your work to customers. There is simply no other way of designing and
building a viable system.
struggling to wrap your head around this concept as you to try to identify the areas
of volatility in your current system. Consequently, volatility-based decomposition
takes longer compared with functional decomposition.
Note that volatility-based decomposition does not mean you should ignore the
requirements. You must analyze the requirements to recognize the areas of vola-
tility. Arguably, the whole purpose of requirements analysis is to identify the areas
of volatility, and this analysis requires effort and sweat. This is actually great news
because now you are given a chance to comply with the fi rst law of thermodynam-
ics. Sadly, merely sweating on the problem does not mean a thing. The fi rst law
of thermodynamics does not state that if you sweat on something, you will add
value. Adding value is much more difficult. This book provides you with powerful
mental tools for design and analysis, including structure, guidelines, and a sound
engineering methodology. These tools give you a fighting chance in your quest to
add value. You still must practice and fight.
The 2% Problem
With every knowledge-intensive subject, it takes time to become proficient and
effective and even more to excel at it. This is true in areas as varied as kitchen
plumbing, internal medicine, and software architecture. In life, you often choose
not to pursue certain areas of expertise because the time and cost required to mas-
ter them would dwarf the time and cost required to utilize an expert. For example,
precluding any chronic health problem, a working-age person is sick for about a
week a year. A week a year of downtime due to illness is roughly 2% of the working
year. So, when you are sick, do you open up medicine books and start reading, or
do you go and see a doctor? At only 2% of your time, the frequency is low enough
(and the specialty bar high enough) that there is little sense in doing anything other
than going to the doctor. It is not worth your while to become as good as a doctor.
If, however, you were sick 80% of the time, you might spend a considerable portion
of your time educating yourself about your condition, possible complications, treat-
ments, and options, often to the point of sparring with your doctor. Your innate
propensity for anatomy and medicine has not changed; only your degree of invest-
ment has (hopefully, you will never have to be really good at medicine).
Similarly, when your kitchen sink is clogged somewhere behind the garbage dis-
posal and the dishwasher, do you go to the hardware store, purchase a P-trap, an
S-trap, various adapters, three different types of wrenches, various O-rings and
other accessories, or do you call a plumber? It is the 2% problem again: it is not
worth your while learning how to fi x that sink if it is clogged less than 2% of the
time. The moral is that when you spend 2% of your time on any complex task, you
will never be any good at it.
36 C HAPTER 2 D ECOMPOSITION
When the manager is throwing hands in the air saying, “I don't understand why
this is taking so long,” the manager really does not understand why you cannot
just do the A, then B, and then C. Do not be upset. You should expect this behavior
and resolve it correctly by educating your manager and peers who, by their own
admission, do not understand.
2. Justin Kruger and David Dunning, “Unskilled and Unaware of It: How Difficulties in Recognizing
One’s Own Incompetence Lead to Infl ated Self-Assessments,” Journal of Personality and Social
Psychology 77, no. 6 (1999): 1121–1134.
I DENTIFYING VOL ATILIT Y 37
Fighting Insanity
Albert Einstein is attributed with saying that doing things the same way but
expecting better results is the defi nition of insanity. Since the manager typically
expects you to do better than last time, you must point out the insanity of pursu-
ing functional decomposition yet again and explain the merits of volatility-based
decomposition. In the end, even if you fail to convince a single person, you should
not simply follow orders and dig the project into an early grave. You must still
decompose based on volatility. Your professional integrity (and ultimately your
sanity and long-term peace of mind) is at stake.
IDENTIFYING VOLATILITY
The rest of this chapter provides you with a set of tools to use when you go search-
ing for and identifying areas of volatility. While these techniques are valuable and
effective in their own right, they are somewhat loose. The next chapter introduces
structure and constraints that allow for quicker and repeatable identification of
areas of volatility. However, that discussion merely fi ne-tunes and specializes the
ideas in this section.
A XES OF VOLATILITY
Finding areas of volatility is a process of discovery that takes place during require-
ments analysis and interviews with the project stakeholders.
There is a simple technique I call axes of volatility. This technique examines the
ways the system is used by customers. Customer in this context refers to a con-
sumer of the system, which could be a single user or a whole other business entity.
38 C HAPTER 2 D ECOMPOSITION
In any business, there are only two ways your system could face change: the fi rst
axis is at the same customer over time. Even if presently the system is perfectly
aligned with a particular customer’s needs, over time, that customer’s business
context will change. Even the use of the system by the customer will often change
the requirements against which it was written in the fi rst place. 3 Over time, the
customer’s requirements and expectation of the system will change.
The second way change could come is at the same time across customers. If you
could freeze time and examine your customer base, are all your customers now
using the system in exactly the same way? What are some of them doing that is
different from the others? Do you have to accommodate such differences? All such
changes defi ne the second axis of volatility.
When searching for potential volatility in interviews, you will fi nd it very helpful
to phrase the questions in terms of the axes of volatility (same customer over time,
all customers at the same point in time). Framing the questions in this way helps
you identify the volatilities. If something does not map to the axes of volatility,
you should not encapsulate it at all, and there should be no building block in your
system to which it is mapped. Creating such a block would likely indicate func-
tional decomposition.
Design Factoring
Often, the act of looking for areas of volatility using the axes of volatility is an
iterative process interleaved with the factoring of the design itself. Consider, for
example, the progression of design iterations in Figure 2-10.
A B C
3. The tendency of a solution to change the requirements against which it was developed was fi rst
observed by the 19th-century English economist William Jevons with regard to coal production, and
it is referred to since as the Jevons paradox. Other manifestations are the increase in paper consump-
tion with the digital office and the worsening traffic congestion following an increase in road capacity.
I DENTIFYING VOL ATILIT Y 39
Your fi rst take of the proposed architecture might look like diagram A—one big
thing, one single component. Ask yourself, Could you use the same component, as
is, with a particular customer, forever? If the answer is no, then why? Often, it is
because you know that customer will, over time, want to change a specific thing.
In that case, you must encapsulate that thing, yielding diagram B. Ask yourself
now, Could you use diagram B across all customers now? If the answer is no, then
identify the thing that the customers want to do differently, encapsulate it, and
produce diagram C. You keep factoring the design that way until all possible
points on the axes of volatility are encapsulated.
Now, even at the same point in time, is your house the same as every other house?
Other houses have a different structure, so the structure of the house is volatile.
Even if you were to copy and paste your house to another city, would it be the
same house?4 The answer is clearly negative. The house will have different neigh-
bors and be subjected to different city regulations, building codes, and taxes.
Figure 2-12 shows this possible decomposition along the second axis of volatility
(different customers at the same point in time).
Note the independence of the axis. The city where you live over time does change
its regulations, but the changes come at a slow pace. Similarly, the likelihood of
new neighbors is fairly low as long as you live in the same house but is a certainty
if you compare your house to another at the same point in time. The assignment of
a volatility to one of the axes is therefore not an absolute exclusion but more one
of disproportional probability.
Note also that the Neighbors Volatility component can deal with volatility
of neighbors at the same house over time as easily as it can do that across different
houses at the same point in time. Assigning the component to an axis helps to
discover the volatility in the fi rst place; the volatility is just more apparent across
different houses at the same point in time.
Finally, in sharp contrast to the decompositions of Figure 2-6 and Figure 2-7,
in Figure 2-11 and Figure 2-12 there is no component in the decomposition for
cooking or kitchen. In a volatility-based decomposition, the required behavior is
accomplished by an interaction between the various encapsulated areas of volatil-
ity. Cooking dinner may be the product of an interaction between the occupants,
the appliances, the structure, and the utilities. Since something still needs to man-
age that interaction, the design is not complete. The axes of volatility are a great
starting point, but it is not the only tool to bring to bear on the problem.
4. The ancient Greeks grappled with this question in Theseus’s paradox (https://en.wikipedia.org/wiki/
Ship_of_Theseus).
I DENTIFYING VOL ATILIT Y 41
The real requirement for any house is to take care of the well-being of the occu-
pants, not just their caloric intake. The house should not be too cold or too warm
or too humid or too dry. While the customers may only discuss cooking and never
discuss temperature control, you should recognize the real volatility, well-being,
and encapsulate that in the Wellbeing component of your architecture.
The fact that requirements specifications have all those solutions masquerading
as requirements is actually a blessing in disguise because you can generalize the
example of cooking in the house into a bona fide analysis technique for discover-
ing areas of volatility. Start by pointing out the solutions masquerading as require-
ments, and ask if there are other possible solutions? If so, then what were the
real requirements and the underlying volatility? Once you identify the volatility,
you must determine if the need to address that volatility is a true requirement or
42 C HAPTER 2 D ECOMPOSITION
VOLATILITIES LIST
Prior to decomposing a system and creating an architecture, you should simply
compile a list of the candidate areas of volatility as a natural part of requirements
gathering and analysis. You should approach the list with an open mind. Ask
what could change along the axes of volatility. Identify solutions masquerading as
requirements, and apply the additional techniques described later in this chapter.
The list is a powerful instrument for keeping track of your observations and orga-
nizing your thoughts. Do not commit yet to the actual design. All you are doing is
maintaining a list. Note that while the design of the system should not take more
than a few days, identifying the correct areas of volatility may take considerably
longer.
• User volatility. The traders serve end customers on whose portfolios they oper-
ate. The end customers are also likely interested in the current state of their
funds. While they could write the trader a letter or call, a more appropriate
means would be for the end customers to log into the system to see the current
balance and the ongoing trades. Even though the requirements never stated
anything about end customer access (the requirements were for professional
traders), you should contemplate such access. While the end customers may not
be able to trade, they should be able to see the status of their accounts. There
could also be system administrators. There is volatility in the type of user.
• Client application volatility. Volatility in users often manifests in volatility in
the type of client application and technology. A simple web page may suffice for
external end customers looking up their balance. However, professional trad-
ers will prefer a multi-monitor, rich desktop application with market trends,
account details, market tickers, newsfeed, spreadsheet projection, and propri-
etary data. Other users may want to review the trades on mobile devices of
various types.
• Security volatility. Volatility in users implies volatility in how the users authen-
ticate themselves against the system. The number of in-house traders could be
small, from a few dozens to a few hundred. The system, however, could have
I DENTIFYING VOL ATILIT Y 43
millions of end customers. The in-house traders could rely on domain accounts
for authentication, but this is a poor choice for the millions of customers access-
ing information through the Internet. For Internet users, perhaps a simple user
name and password will do, or maybe some sophisticated federated security
single sign-on option is needed. Similar volatility exists with authorization
options. Security is volatile.
• Notification volatility. The requirements specify that the system is to send an
email after every request. However, what if the email bounces? Should the sys-
tem fall back to a paper letter? How about a text message or a fax instead of
an email? The requirement to send an email is a solution masquerading as a
requirement. The real requirement is to notify the users, but the notification
transport is volatile. There is also volatility in who receives the notification: a
single user or a broadcast to several users receiving the same notification and
over whichever transport. Perhaps the end customer prefers an email while the
end customer’s tax lawyer prefers a documented paper statement. There is also
volatility in who publishes the notification in the fi rst place.
• Storage volatility. The requirements specify the use of a local database. However,
over time, more and more systems migrate to the cloud. There is nothing inherent
in stock trading that precludes benefiting from the cost and economy of scale of
the cloud. The requirement to use a local database is actually another solution
masquerading as a requirement. A better requirement is data persistence, which
accommodates the volatility in the persistence options. However, the majority
of users are end customers, and those users actually perform read-only requests.
This implies the system will benefit greatly from the use of an in-memory cache.
Furthermore, some cloud offerings utilize a distributed in-memory hash table that
offers the same resiliency as traditional file-based durable storage. Requiring data
persistence would exclude these last two options because data persistence is still
a solution masquerading as a requirement. The real requirement is simply that
the system must not lose the data, or that the system is required to store the
data. How that is accomplished is an implementation detail, with a great deal
of volatility, from a local database to a remote in-memory cache in the cloud.
• Connection and synchronization volatility. The current requirements call for
a connected, synchronous, lock-step manner of completing a web form and
submitting it in-order. This implies that the traders can do only one request
at a time. However, the more trades the traders execute, the more money they
make. If the requests are independent, why not issue them asynchronously? If
the requests are deferred in time (trades in the future), why not queue up the
calls to the system to reduce the load? When performing asynchronous calls
(including queued calls), the requests can execute out of order. Connectivity and
synchronicity are volatile.
44 C HAPTER 2 D ECOMPOSITION
• Duration and device volatility. Some users will complete a trade in one short
session. However, traders earn their keep and maximize their income when they
perform complicated trades that distribute and hedge risk, involving multiple
stocks and sectors, domestic or foreign markets, and so on. Constructing such
a trade can be time-consuming, lasting anywhere from several hours to several
days. Such a long-running interaction will likely span multiple system sessions
and possibly multiple physical devices. There is volatility in the duration of
the interaction, which in turn triggers volatility in the devices and connections
involved.
• Trade item volatility. As discussed previously, over time, the end customers
may want to trade not just stocks but also commodities, bonds, currencies, and
maybe even future contracts. The trade item itself is volatile.
• Workflow volatility. If the trade item is volatile, processing of the steps involved
in the trade will be volatile too. Buying and selling stocks, scheduling their
orders, and so on are very different from selling commodities, bonds, or cur-
rencies. The workflow of the trade is therefore volatile. Similarly, the workflow
of trade analysis is volatile.
• Locale and regulations volatility. Over time, the system may be deployed into
different locales. Volatility in the locale has drastic implications on the trading
rules, UI localization, the listing of trade items, taxation, and regulatory com-
pliance. The locale and the regulations that apply therein are volatile.
• Market feed volatility. The source of market data could change over time.
Various feeds have a different format, cost, update rate, communication proto-
cols, and so on. Different feeds may show slightly different value for the same
stock at the same point in time. The feeds can be external (e.g., Bloomberg or
Reuters) or internal (e.g., simulated market data for testing, diagnostics, or
trading algorithms research). The market feed is volatile.
A Key Observation
The preceding list is by no means an exhaustive list of all the things that could
change in a stock trading system. Its objective is to point out what could change
and the mindset you need to adopt when searching for volatility. Some of the vol-
atile areas may be out of scope for the project. They may be ruled out by domain
experts as improbable or may relate too much to the nature of the business (such
as branching out of stocks into currencies or foreign markets). My experience,
however, is that it is vital to call out the areas of volatility and map them in your
decomposition as early as possible. Designating a component in the architecture
costs you next to nothing. Later, you must decide whether or not to allocate the
effort to designing and constructing it. However, at least now you are aware how
to handle that eventuality.
I DENTIFYING VOL ATILIT Y 45
System Decomposition
Once you have settled on the areas of volatility, you need to encapsulate them in
components of the architecture. One such possible decomposition is depicted in
Figure 2-13.
Trade Analysis
Notification
Workflow Workflow
Feed
Transformation
The transition from the list of volatile areas to components of the architecture is
hardly ever one to one. Sometimes a single component can encapsulate more than
one area of volatility. Some areas of volatility may not be mapped directly to a com-
ponent but rather to an operational concept such as queuing or publishing an event.
At other times, the volatility of an area may be encapsulated in a third-party service.
With design, always start with the simple and easy decisions. Those decisions
constrain the system, making subsequent decisions easier. In this example, some
mapping is easy to do. The volatility in the data storage is encapsulated behind
data access components, which do not betray where the storage is and what tech-
nology is used to access it. Note in Figure 2-13 the key abstraction of referring to
the storage as Storage and not as Database. While the implementation (accord-
ing to the requirements) is a local database, there is nothing in the architecture
that precludes other options, such as the raw fi le system, a cache, or the cloud. If
a change to the storage takes place, it is encapsulated in the respective access com-
ponent (such as the Trades Access) and does not affect the other components,
including any other access component. This enables you to change the storage
with minimal consequences.
46 C HAPTER 2 D ECOMPOSITION
You can use the same pattern for the stock trading workflow and the analysis
workflows. The dedicated Analysis Workflow component encapsulated the vol-
atility in the analysis workflows, and it can use the same Workflow Storage.
The volatility of accessing the market feed is encapsulated in the Feed Access.
This component encapsulates how to access the feed and whether the feed itself
is internal or external. The volatility in the format or even value of the var-
ious market data coming from the different feeds is encapsulated in the Feed
Transformation component. Both of these components decouple the other
components from the feeds by providing a uniform interface and format regardless
of the origin of the data.
I DENTIFYING VOL ATILIT Y 47
The clients of the system can be the trading application (Trader App A) or a mobile
app (Trader App B). The end customers can use their own website (Customer
Portal). Each client application also encapsulates the details and the best way of
rendering the information on the target device.
In Homer’s Odyssey, a story that is more than 2500 years old, Odysseus sails home
via the Straights of the Sirens. The Sirens are beautiful winged fairy-like creatures
who have the voices of angels. They sing a song that no man can resist. The sailors
jump to their arms, and the Sirens drown the men under the waves and eat them.
Before encountering the deadly allure of the Sirens’ songs, Odysseus (you, the
architect) is advised to plug with beeswax the ears of his sailors (the rank and fi le
software developers) and tie them to the oars. The sailors’ job is to row (write
code), and they are not even at liberty to listen to the Sirens. Odysseus himself, on
the other hand, as the leader, does not have the luxury of plugging his ears (e.g.,
maybe you do need that reporting block). Odysseus ties himself to the mast of
the ship so that he cannot succumb to the Sirens even if he wanted to do so (see
Figure 2-14, depicting the scene on a period vase). You are Odysseus, and volatil-
ity-based decomposition is your mast. Resist the siren song of your previous bad
habits.
48 C HAPTER 2 D ECOMPOSITION
During system decomposition, you must identify both the areas of volatility
to encapsulate and those not to encapsulate (e.g., the nature of the business).
Sometimes, you will have initial difficulty in telling these apart. There are two
simple indicators if something that could change is indeed part of the nature of
the business. The first indicator is that the possible change is rare. Yes, it could
I DENTIFYING VOL ATILIT Y 49
happen, but the likelihood of it happening is very low. The second indicator is
that any attempt to encapsulate the change can only be done poorly. No practical
amount of investment in time or effort will properly encapsulate the aspect in a
way of which you can be proud.
When you are fi nished, the foundation will encapsulate the change to the weight
of the building, the power panel will encapsulate the demands of both a single
home and 50 stories, and so on. However, the two indicators are now violated.
First, how many homeowners in your city annually do convert their home to a
skyscraper? How common is that? In a large metropolitan area with a million
homes, it may happen once every few years, making the change very rare, once in a
million if that. Second, do you really have the funds (allocated initially for a single
home) to properly execute all these encapsulations? A single pylon may cost more
than the single-family building. Any attempt to encapsulate the future transition
to a skyscraper will be done poorly and will be neither useful nor cost-effective.
Speculative Design
Speculative design is a variation on trying to encapsulate the nature of the business.
Once you subscribe to the principle of volatility-based decomposition, you will
start seeing possible volatilities everywhere and can easily overdo it. When taken
to the extreme, you run the risk of trying to encapsulate anything and everywhere.
Your design will have numerous building blocks, a clear sign of a bad design.
The item is a pair of SCUBA-ready lady’s high heels. While a lady adorned in
a fi ne evening gown could entertain her guests at the party wearing these, how
likely is it that she will excuse herself, proceed immediately to the porch, draw
on SCUBA gear, and dive into the reef? Are these shoes as elegant as conven-
tional high heels? Are these as effective as regular fl ippers when it comes to
swimming or stepping on sharp coral? While the use of the items in Figure 2-15
is possible, it is extremely unlikely. In addition, everything they try to provide is
done very poorly because of the attempt to encapsulate a change to the nature of
the shoe, from a fashion accessory to a diving accessory, something you should
never attempt. If you try this, you have fallen into the speculative design trap.
Most such designs are simply frivolous speculation on a future change to your
system (i.e., a change to the nature of the business).
are the lead architect for Federal Express’s next-generation system. Your main com-
petitor is UPS. Both Federal Express and UPS ship packages. Both collect funds,
schedule pickup and delivery, track packages, insure content, and manage trucks
and airplane fleets. Ask yourself the following question: Can Federal Express use
the software system UPS is using? Can UPS use the system Federal Express wants
to build? If the likely answer is no, start listing all the barriers for such a reuse
or extensibility. While both companies perform in the abstract the same service,
the way they conduct their business is different. For example, Federal Express
may plan shipment routes one way, while UPS may plan them another. In that
case, shipment planning is probably volatile because if there are two ways of
doing something, there may be many more. You must encapsulate the shipment
planning and designate a component in your architecture for that purpose. If
Federal Express starts planning shipments the same as UPS at some future time,
the change is now contained in a single component, making the change easy and
affecting only the implementation of that component, not the decomposition. You
have future-proofed your system.
The opposite case is also true. If you and your competitor (and even better, all
competitors) do some activity or sequence the same way, and there is no chance of
your system doing it any other way, then there is no need to allocate a component
in the architecture for that activity. To do so would create a functional decomposi-
tion. When you encounter something your competitors do identically, more likely
than not, it represents the nature of the business, and as discussed previously, you
should not encapsulate it.
You can even guesstimate how long it will be until such a change is likely to take
place using a simple heuristic: the ability of the organization (or the customer or
the market) to instigate or absorb a change is more or less constant because it is
tied to the nature of the business. For example, a hospital IT department is more
conservative and has less tolerance for change than a nascent blockchain startup.
Thus, the more frequently things change, the more likely they will change in the
future, but at the same rate. For example, if every 2 years the company changes
its payroll system, it is likely the company will change the payroll system within
52 C HAPTER 2 D ECOMPOSITION
the next 2 years. If the system you design needs to interface with the payroll sys-
tem and the horizon for using your system is longer than 2 years, then you must
encapsulate the volatility in the payroll system and plan to contain the expected
change. You must take into account the effect of a payroll system change even if
the change was never given to you as an explicit requirement. You should strive
to encapsulate changes that occur within the life of the system. If that projected
lifespan is 5 to 7 years, a good starting point is identifying all the things that have
changed in the application domain over the past 7 years. It is likely similar changes
will occur within a similar timespan.
You should examine this way the longevity of all involved systems and subsystems
with which your design interacts. For example, if the enterprise resource planning
(ERP) system changes every 10 years, the last ERP change was 8 years ago, and the
horizon for your new system is 5 years, then it is a good bet the ERP will change
during the life of your system.
• Practice on an everyday software system with which you are familiar, such as
your typical insurance company, a mobile app, a bank, or an online store.
• Examine your own past projects. In hindsight, you already know what the pain
points were. Was that architecture of that past project done functionally? What
I DENTIFYING VOL ATILIT Y 53
things did change? What were the ripple effects of those changes? If you had
encapsulated that volatility, would you have been able to deal with that change
better?
• Look at your current project. It may not be too late to save it: Is it designed func-
tionally? Can you list the areas of volatility and propose a superior architecture?
• Look at non-software systems such as a bicycle, a laptop, a house, and identify
in those the areas of volatility.
Then do it again and do it some more. Practice and practice. After you have ana-
lyzed three to five systems, you should get the general technique. Sadly, learning to
identify areas of volatility is not something you get to master by watching others.
You cannot learn to ride a bicycle from a book. You have to mount a bicycle (and
fall) a few times. The same is true with volatility-based decomposition. It is, how-
ever, preferable to fall during practice than to experiment on live subjects.
This page intentionally left blank
3
STRUCTURE
The previous chapter discussed the universal design principle of volatility-based
decomposition. This principle governs the design of all practical systems—from
houses, to laptops, to jumbo planes, to your own body. To survive and thrive, they
all encapsulate the volatility of their constituent components. Software architects
only have to design software systems. Fortunately, these systems share common
areas of volatility. Over the years I have found these common areas of volatil-
ity within hundreds of systems. Furthermore, there are typical interactions, con-
straints, and run-time relationships between these common areas of volatility. If
you recognize these, you can produce correct system architecture quickly, effi-
ciently, and effectively.
Given this observation, The Method provides a template for the areas of volatility,
guidelines for the interaction, and recommends operational patterns. By doing so,
The Method goes beyond mere decomposition. Being able to furnish such general
guidelines and structure across most software systems may sound far-fetched.
You may wonder how these kinds of broad strokes could possibly apply across
the diversity of software systems. The reason is that good architectures allow use
in different contexts. For example, a mouse and an elephant are vastly different,
yet they use identical architecture. The detailed designs of the mouse and the ele-
phant, however, are very different. Similarly, The Method can provide you with
the system architecture, but not its detailed design.
This chapter is all about The Method’s way of structuring a system, the advan-
tages this brings, and its implications on the architecture. You will see classifica-
tion of services based on their semantics and the associated guidelines, as well as
how to layer your design. In addition, having clear, consistent nomenclature for
components in your architecture and their relationship brings two other advan-
tages. First, it provides a good starting point. You will still have to sweat over it,
but at least you start at a reasonable point. Second, it improves communication
because you can now convey your design intent to other architects or developers.
Even communicating with yourself in this way is very valuable, as it helps to clarify
your own thoughts.
55
56 C HAPTER 3 S TRUCTURE
Requirements should capture the required behavior rather than the required func-
tionality. You should specify how the system is required to operate as opposed to
what it should do, which is arguably the essence of requirements gathering. As
with most other things, this does take additional work and effort (something that
people in general try to avoid), so getting requirements into this form will be an
uphill struggle.
REQUIRED BEHAVIORS
A use case is an expression of required behavior—that is, how the system is
required to go about accomplishing some work and adding value to the business.
As such, a use case is a particular sequence of activities in the system. Use cases
tend to be verbose and descriptive. They can describe end-user interactions with
the system, or the system’s interactions with other systems, or back-end process-
ing. This ability is important because in any well-designed system, even one of mod-
est size and complexity, the users interact with or observe just a small part of the
system, which represents the tip of the iceberg. The bulk of the system remains
below the waterline, and you should produce use cases for it as well.
You can capture use cases either textually or graphically. Textual use cases are
easy to produce, which is a distinct advantage. Unfortunately, using text for use
cases is an inferior way of describing use cases because the use cases may be too
complex to capture with high fidelity in text. The real problem with textual use
cases is that hardly anyone bothers to read even simple text, and for a good reason.
Reading is an artificial activity for the human brain, because the brain is not wired
to easily absorb and process complex ideas via text. Mankind has been reading
for 5000 years—not long enough for the brain to catch up, evolutionarily speaking
(thank you for making the effort with this book, though).
U SE C ASES AND R EQUIREMENTS 57
The best way of capturing a use case is graphically, with a diagram (Figure 3-1).
Humans perform image processing astonishingly quickly, because almost half
the human brain is a massive video processing unit. Diagrams allow you to take
advantage of this processor to communicate ideas to your audience.
Activity Diagrams
The Method prefers activity diagrams1 for graphical representation of use cases,
primarily because activity diagrams can capture time-critical aspects of behavior,
something that flowcharts and other diagrams are incapable of doing. You cannot
represent parallel execution, blocking, or waiting for some event to take place in
a flowchart. Activity diagrams, by contrast, incorporate a notion of concurrency.
1. https://en.wikipedia.org/wiki/Activity_diagram
58 C HAPTER 3 S TRUCTURE
For example, in Figure 3-2, you intuitively see the handling of parallel execution
as a response to the event without even seeing a notation guide for the diagram.
Note also how easy it is to follow the nested condition.
[New Batch]
Log Process
[No] [Yes]
Valid?
Caution Do not confuse activity diagrams with use case diagrams. Use
case diagramsa are user-centric and should have been called user case dia-
grams. Use case diagrams also do not include a notion of time or sequence.
a. https://en.wikipedia.org/wiki/Use_case_diagram
LAYERED APPROACH
Software systems are typically designed in layers, and The Method relies heavily
on layers. Layers allow you to layer encapsulation. Each layer encapsulates its own
volatilities from the layers above and the volatilities in the layers below. Services
inside the layers encapsulate volatility from each other, as shown in Figure 3-3.
L AYERED A PPROACH 59
Resource Resource
Even simple systems should be designed in layers to gain the benefit of encapsula-
tion. In theory, the more layers, the better the encapsulation. Practical systems will
have only a handful of layers, terminating with a layer of actual physical resources
such as a data storage or a message queue.
USING SERVICES
The preferred way of crossing layers is by calling services. While you certainly
can benefit from the structure of The Method and volatility-based decomposition
even with regular classes, relying on services provides distinct advantages. Which
technology and platform you use to implement your services is a secondary con-
cern. When you do use services (as long as the technology you chose allows), you
immediately gain the following benefits:
TYPICAL LAYERS
The Method calls for four layers in the system architecture. These layers conform
to some classic software engineering practices. However, using volatility to drive
the decomposition inside these layers may be new to you. Figure 3-4 depicts the
typical layers in The Method.
Utilities
Diagnostics
Business
Manager A Manager B Manager C Manager D
Logic
Logging
Pub/Sub
Resource ResourceAccess ResourceAccess ResourceAccess
Access A B C
Message
Bus
The client layer also encapsulates the potential volatility in Clients. Your system
now and in the future across the axes of volatility may have different Clients such
as desktop applications, web portals, mobile apps, holograms and augmented real-
ity, APIs, administration applications, and so on. The various Client applications
will use different technologies, be deployed differently, have their own versions
and life cycles, and may be developed by different teams. Indeed, the client layer
is often the most volatile part of a typical software system. However, all of that
volatility is encapsulated in the various blocks of the client layer, and changes in
one component do not affect another Client component.
need for a business logic layer. Use cases, however, are volatile, across both cus-
tomers and time. Since a use case contains a sequence of activities in the system, a
particular use case can change in only two ways: Either the sequence itself changes
or the activities within the use case change. For example, consider the use case in
Figure 3-1 versus the use cases in Figure 3-5.
A A
A B C B [No] [Yes]
?
B
B C
All four use cases in Figures 3-1 and 3-5 use the same activities A, B, and C, but
each sequence is unique. The key observation here is that the sequence or the
orchestration of the workflow can change independently from the activities.
Now consider the two activity diagrams in Figure 3-6. Both call for exactly the
same sequence, but they use different activities. The activities can change inde-
pendently from the sequence.
Both the sequence and the activities are volatile, and in The Method these vol-
atilities are encapsulated in specific components called Managers and Engines.
Manager components encapsulate the volatility in the sequence, whereas Engine
components encapsulate the volatility in the activity. In Chapter 2, in the stock
trading decomposition example, the Trade Workflow component (see Figure 2-13)
is a Manager, while the Feed Transformation component is an Engine.
Since use cases are often related, Managers tend to encapsulate a family of logi-
cally related use cases, such as those in a particular subsystem. For example, with
the stock trading system of Chapter 2, Analysis Workflow is a separate Manager
TYPICAL L AYERS 63
A A
B D
C C
from Trade Workflow, and each Manager has its own related set of use cases to
execute. Engines have more restricted scope and encapsulate business rules and
activities.
Since you can have great volatility in the sequence without any volatility in the
activities of the sequence (see Figure 3-5), Managers may use zero or more Engines.
Engines may be shared between Managers because you could perform an activity
in one use case on behalf of one Manager and then perform the same activity for
another Manager in a separate use case. You should design Engines with reuse in
mind. However, if two Managers use two different Engines to perform the same
activity, you either have functional decomposition on your hands or you have
missed some activity volatility. You will see more on Managers and Engines later
in this chapter.
While the motivation behind the resource access layer is readily evident and many
systems incorporate some form of an access layer, most such layers end up expos-
ing the underlying volatility by creating a ResourceAccess contract that resembles
I/O operations or that is CRUD-like. For example, if your ResourceAccess service
contract contains operations such as Select(), Insert(), and Delete(), the
underlying resource is most likely a database. If you later change the database to a
distributed cloud-based hash table, that database-access-like contract will become
useless, and a new contract is required. Changing the contract affects every Engine
and Manager that has used the ResourceAccess component. Similarly, you must
avoid operations such as Open(), Close(), Seek(), Read(), and Write()
that betray the underlying resource as being a fi le. A well-designed ResourceAccess
component exposes in its contract the atomic business verbs around a resource.
Atomic business verbs are practically immutable because they relate strongly to
the nature of the business, which, as discussed in Chapter 2, hardly ever changes.
For example, since the time of the Medici, banks have performed credit and debit
operations. Internally, the ResourceAccess service should convert these verbs from
its contract into CRUDs or I/O against the resources. By exposing only the stable
atomic business verbs, when the ResourceAccess service changes, only the inter-
nals of the access component change, rather than the whole system atop it.
ResourceAccess Reuse
ResourceAccess services can be shared between Managers and Engines. You
should explicitly design ResourceAccess components with this reuse in mind. If
two Managers or two Engines cannot use the same ResourceAccess service when
accessing the same resource or have some need for specific access, perhaps you did
not encapsulate some access volatility or did not isolate the atomic business verbs
correctly.
C L ASSIFICATION G UIDELINES 65
UTILITIES BAR
The utilities vertical bar on the right side of Figure 3-4 contains Utility services.
These services are some form of common infrastructure that nearly all systems
require to operate. Utilities may include Security, Logging, Diagnostics,
Instrumentation, Pub/Sub, Message Bus, Hosting, and more. You will see
later in this chapter that Utilities require different rules compared with the other
components.
CLASSIFICATION GUIDELINES
As is true for every good idea, The Method can be abused. Without practice and
critical thinking, it is possible to use The Method taxonomy in name only and still
produce a functional decomposition. You can mitigate this risk to a great extent
by adhering to the simple guidelines provided in this section.
WHAT’S IN A NAME
Service names as well as diagrams are important in communicating your design to
others. Descriptive names are so important within the business and resource access
layers that The Method recommends the following conventions for naming them:
- For Engines, the prefi x should be a noun describing the encapsulated activity.
- For ResourceAccess, the prefix should be a noun associated with the Resource,
such as data that the service provides to the consuming use cases.
• Gerunds (a gerund is a noun created by tacking “ing” onto a verb) should be
used as a prefi x only in with Engines. The use of gerunds elsewhere in the busi-
ness or access layers usually signals functional decomposition.
• Atomic business verbs should not be used in a prefi x for a service name. These
verbs should be confi ned to operation names in contracts interfacing with the
resource access layer.
The four questions loosely correspond to the layers because volatility trumps every-
thing. For example, if there is little or no volatility in the “how,” the Managers can
perform both “what” and “how.”
Asking and answering the four questions is useful at both ends of the design effort,
for initiation and for validation. If all you have is a clean slate and no clear idea
where to start, you can initiate the design effort by answering the four questions.
Make a list of all the “who” and put them in one bin as candidates for Clients. Make
a list of all the “what” and put them in another bin as candidates for Managers, and
so on. The result will not be perfect—for example, all “what” components will not
necessarily coalesce into individual Managers—but it is a start.
C L ASSIFICATION G UIDELINES 67
How
Engine A Engine B
(Business)
Resource Resource
Where
A B
Once you complete your design, take a step back and examine the design. Are
all your Clients “who,” with no trace of “what” in them? Are all the Managers
“what,” without a smidgen of “who” and “where” in them? Again, the mapping
of questions to layers will not be perfect. In some cases, you could have crossover
between questions. However, if you are convinced the encapsulation of the vola-
tility is justified, there is no reason to doubt that choice further. If you are uncon-
vinced, the questions could indicate a red flag and a decomposition to investigate.
The four questions tie in nicely with the previous guideline on naming the ser-
vices. If the Manager prefi xes describe the encapsulated volatilities, it is more
natural to talk about them in terms of “what” as opposed to the verb-like “how.”
If the Engine prefi xes are gerunds describing the encapsulated activities, it is more
natural to talk about them in terms of “how” as opposed to “what” or “where.”
For similar reasons, ResourceAccess encapsulates “how” to access the Resources
that lie behind it.
MANAGERS-TO-ENGINES RATIO
Most designs end up with fewer Engines than you might initially imagine. First,
for an Engine to exist, there must be some fundamental operational volatility that
you should encapsulate—that is, an unknown number of ways of doing some-
thing. Such volatilities are uncommon. If your design contains a large number of
Engines, you may have inadvertently done a functional decomposition.
68 C HAPTER 3 S TRUCTURE
In our work at IDesign, we have observed across numerous systems that Managers
and Engines tend to maintain a golden ratio. If your system has only one Manager
(not a god service), you may have no Engines, or at most one Engine. Think about
it: If the system is so simple that one decent Manager suffices, how likely is it to
have high volatility in the activities but not that many types of use cases?
Generally, if your system has two Managers, you will likely need one Engine. If
your system has three Managers, two Engines is likely the best number. If your
system has five Managers, you may need as many as three Engines. If your sys-
tem has eight Managers, then you have already failed to produce a good design:
The large number of Managers strongly indicates you have done a functional
or domain decomposition. Most systems will never have that many Managers
because they will not have many truly independent families of use cases with their
own volatility. In addition, a Manager can support more than one family of use
cases, often expressed as different service contracts, or facets of the service. This
can further reduce the number of Managers in a system.
KEY OBSERVATIONS
Armed with the recommendations of The Method, you can make some sweep-
ing observations about the qualities you expect to see in a well-designed system.
Deviating from these observations may indicate a lingering functional decompo-
sition or at least an unripe decomposition in which you have encapsulated few of
the glaring volatilities but have missed others.
Note Clear terminology for the various pieces of the architecture en-
ables this type of communication of observations and recommendations.
You can change activities and their sequence without ever changing the mapping
of the atomic business verbs to Resources. Resources are the least volatile compo-
nents, changing at a glacial pace compared with the rest of the system.
A design in which the volatility decreases down the layers is extremely valuable.
The components in the lower layers have more items that depend on them. If the
components you depend upon the most are also the most volatile, your system will
implode.
Almost-Expendable Managers
Managers can fall into one of three categories: expensive, expendable, and almost
expendable. You can distinguish the category to which a Manager belongs by the
way you respond when you are asked to change it. If your response is to fight the
change, to fear its cost, to argue against the change, and so forth, then the Manager
was clearly expensive and not expendable. An expensive Manager indicates that
the Manager is too big, likely due to functional decomposition. If your response to
the change request is just to shrug it off, thinking little of it, the Manager is pass-
through and expendable. Expendable Managers are always a design flaw and a
distortion of the architecture. They often exist only to satisfy the design guidelines
without any real need for encapsulating use case volatility.
Avoid over-partitioning your system into subsystems. Most systems should have
only a handful of subsystems. Likewise, you should limit the number of Managers
per subsystem to three. This also allows you to somewhat increase the total num-
ber of Managers in the system across all subsystems.
INCREMENTAL CONSTRUCTION
If the system is relatively simple and small, the business value of the system—that
is, the execution of the use cases—will likely require all components of the archi-
tecture. For such systems, there is no sense in releasing, say, just the Engines or the
ResourceAccess components.
With a large system, it could be that certain subsystems (such as the vertical slices
of Figure 3-8) can stand alone and provide direct business value. Such systems will
be more expensive to build and take longer to complete. In such cases it makes
sense to develop and deliver the system in stages, one slice at a time, as opposed
S UBSYSTEMS AND S ERVICES 71
to providing a single release at the end of the project. Moreover, the customer will
be able to provide early feedback to the developers on the incremental releases as
opposed to only the complete system at the end.
With both small and large systems, the right approach to construction is another
universal principle:
This principle is true regardless of domain and industry. For example, suppose you
wish to build your house on a plot of land you have purchased. Even the best archi-
tect will not be able to produce the design for your house in a single session. There
will be some back-and-forth as you defi ne the problem and discuss constraints
such as funds, occupants, style, time, and risk. You will start with some rough cuts
to the blueprints, refi ne them, evaluate the implications, and examine alternatives.
After several of these iterations, the design will converge. When it is time to build
the house, will you do that iteratively, too? Will you start with a two-person tent,
grow it out to a four-person tent, then to a small shed, then to a small house, and
fi nally to a bigger house? It would be insane to even contemplate such an approach.
Instead, you are likely to dig and cast the foundation, then erect the walls to the
fi rst floor, then connect utilities to the structure, then add the second floor, and
fi nally add the roof. In short, you build a simple house incrementally. There is no
value for the prospective homeowner in having just the foundations or the roof.
That is, the house—like an incrementally built simple software system—has no
real value until complete. However, if the building has multiple floors (or multiple
wings), it may be possible to build it incrementally and deliver intermediate value.
Your design may allow you to complete one floor at a time (or one wing at a time),
similar to the “one slice at a time” approach to a large software system.
Another example is assembling cars. While the car company may have had a team
of designers designing a car across multiple iterations, when it is time to build the
car, the manufacturing process does not start with a skateboard, grow that to a
scooter, then a bicycle, then a motorcycle, and fi nally a car. Instead, a car is built
incrementally. First, the workers weld a chassis together, then they bolt on the
engine block, and then they add the seats, the skin, and the tires. They paint the
car, add the dashboard, and fi nally install the upholstery.
There are two reasons why you can build only incrementally, and not iteratively.
First, building iteratively is horrendously wasteful and difficult (turning a motor-
cycle into a car is much more difficult than just building a car). Second, and much
more importantly, the intermediate iterations do not have any business value. If
72 C HAPTER 3 S TRUCTURE
the customer wants a car to take the kids to school, what would the customer do
with a motorcycle and why should the customer pay for it?
The ability to build incrementally over time, within the confi nes of the architec-
ture, is predicated on the architecture remaining constant and true. With func-
tional decomposition, you face ever-shifting piles of debris. It is fair to assume
that those who know only functional decomposition are condemned to iterative
construction. With volatility-based decomposition, you have a chance of getting
it right.
Extensibility
The vertical slices of the system also enable you to accommodate extensibility.
The correct way of extending any system is not by opening it up and hammering
on existing components. If you have designed correctly for extensibility, you can
mostly leave existing things alone and extend the system as a whole. Continuing
the house analogy, if you want to add a second floor to a single-story house at
some point in the future, then the fi rst floor must have been designed to carry
the additional load, the plumbing must have been done in a way that could be
extended to the second floor, and so on. Adding a second floor by destroying the
fi rst floor and then building new fi rst and second floors is called rework, not exten-
sibility. The design of a Method-based system is geared toward extensibility: Just
add more of these slices or subsystems.
ABOUT MICROSERVICES
I am credited as one of the pioneers of microservices. As early as 2006, in my
speaking and writing I called for building systems in which every class was a
service. 2,3 This requires the use of a technology that can support such granular
2. https://wikipedia.org/wiki/Microservices#History
3. Juval Löwy, Programming WCF Services, 1st ed. (O’Reilly Media, 2007), 543–553.
S UBSYSTEMS AND S ERVICES 73
The fi rst problem is the implied constraint on the number of services. If smaller
services are better than larger services, why stop at the subsystem level? The sub-
system is still too big as the most granular service unit. Why not have the building
blocks of the subsystem be services? You should push the benefits of services as far
down the architecture as possible. In a Method subsystem, the Manager, Engine,
and ResourceAccess components within a subsystem must be services as well.
4. Löwy, Programming WCF Services, 1st ed., pp. 48–51; Juval Löwy, Programming WCF Services, 3rd ed.
(O’Reilly Media, 2010), 74–75.
74 C HAPTER 3 S TRUCTURE
without gaining any of the benefits of the modularity of the services. This double
punch may be more than what most projects can handle. Indeed, I fear that micros-
ervices will be the biggest failure in the history of software. Maintainable, reusable,
extensible services are possible—just not in this way.
For example, my laptop has a drive that provides it with a very important service:
storage. The laptop also consumes a service offered by the network router for all
DNS requests, and an SMTP server that offers email service. For the external
services, the laptop uses TCP/IP; for the internal services like the drive, it uses
SATA. The laptop utilizes multiple such specialized internal protocols to perform
its essential functions.
Another example is the human body. Your liver provides you with a very important
service: metabolism. Your body also provides a valuable service to your custom-
ers and organization, and you use a natural language (English) to communicate
with them. However, you do not speak English to communicate with your liver.
Instead, you use nerves and hormones.
The protocol used for external services is typically low bandwidth, slow, expen-
sive, and error prone. Such attributes indicate a high degree of decoupling.
Unreliable HTTP may be perfect for external services, but this protocol should
be avoided between internal services where the communication and the services
must be impeccable.
Using the wrong protocol between services can be fatal. It is not the end of the world
if you cannot talk with your boss or have a misunderstanding with a customer, but
you will die if you cannot communicate correctly or at all with your liver.
Similar level-of-service issues exist with specialization and efficiency. Using HTTP
between internal services is akin to using English to control your body’s internal
O PEN AND C LOSED A RCHITECTURES 75
services. Even if the words were perfectly heard and understood, English lacks the
adaptability, performance, and vocabulary required for describing the internal
services’ interactions.
Internal services such as Engines and ResourceAccess should rely on fast, reliable,
high-performance communication channels. These include TCP/IP, named pipes,
IPC, Domain Sockets, Service Fabric remoting, custom in-memory interception
chains, message queues, and so on.
OPEN ARCHITECTURE
In an open architecture, any component can call any other component regardless
of the layer in which the components reside. Components can call up, sideways,
and down as much as you like. Open architectures offer the ultimate flexibility.
However, an open architecture achieves that flexibility by sacrificing encapsula-
tion and introducing a significant amount of coupling.
For example, imagine in Figure 3-4 that the Engines directly call the Resources.
While such a call is technically possible, when you wish to switch Resources or
merely change the way you access a Resource, suddenly all your Engines must
change. How about the Clients calling ResourceAccess services directly? While
that is not as bad as calling the Resources themselves, all the business logic must
migrate to the Clients. Any change to the business logic would then force rework-
ing the Clients.
to encapsulate a set of independent use cases. Are the use cases of Manager B now
independent of those of Manager A? Any change to Manager B’s way of doing the
activity will break Manager A, calling to mind the issues of Figure 2-5. Calling
sideways in this way is almost always the result of functional decomposition at the
Managers level.
How about Engine A calling Engine B? Was Engine B a separate volatile activ-
ity from Engine A? Again, functional decomposition is likely behind the need to
chain the Engines calls.
When using open architecture, there is hardly any benefit of having architectural
layers in the fi rst place. In general, in software engineering, trading encapsulation
for flexibility is a bad trade.
CLOSED ARCHITECTURE
In a closed architecture, you strive to maximize the benefits of the layers by disal-
lowing calling up between layers and sideways within layers. Disallowing calling
down between layers would maximize the decoupling between the layers but pro-
duce a useless design. A closed architecture opens a chink in the layers, allowing
components in one layer to call those in the adjacent lower layer. The components
within a layer are of service to the components in the layer immediately above
them, but they encapsulate whatever happens underneath. Closed architecture
promotes decoupling by trading flexibility for encapsulation. In general, that is a
better trade than the other way around.
SEMI-CLOSED/SEMI-OPEN ARCHITECTURE
It is easy to point out the clear problems with open architectures—of allowing
calling up, down, or sideways. However, are all three sins equally bad? The worst
of them is calling up: That not only creates cross-layer coupling, but also imports
the volatility of a higher layer to the lower layers. The second worst offender is
calling sideways because such calls couple components inside the layer. The closed
architecture allows calling one layer below, but what about calling multiple layers
down? A semi-closed/semi-open architecture allows calling more than one layer
down. This, again, is a trade of encapsulation for flexibility and performance and,
in general, is a trade to avoid.
down multiple layers may adversely affect performance. For example, consider the
Open Systems Interconnection (OSI) model of seven layers for network commu-
nication. 5 When vendors implement this model in their TCP stack, they cannot
afford the performance penalty incurred by seven layers for every call, and they
sensibly choose a semi-closed/semi-open architecture for the stack. The second
case occurs within a codebase that hardly ever changes. The loss in encapsulation
and the additional coupling in such a codebase is immaterial because you will not
have to maintain the code much, if at all. Again, a network stack implementation
is a good example for code that hardly ever changes.
While closed architecture systems are the most decoupled and the most encapsu-
lated, they are also the least flexible. This inflexibility could lead to Byzantine-like
levels of complexity due to the indirections and intermediacy, and rigid design is
inadvisable. The Method relaxes the rules of closed architecture to reduce com-
plexity and overhead without compromising encapsulation or decoupling.
Calling Utilities
In a closed architecture, Utilities pose a challenge. Consider Logging, a service
used for recording run-time events. If you classify Logging as a Resource, then
the ResourceAccess can use it, but the Managers cannot. If you place Logging
at the same level as the Managers, only the Clients can log. The same goes for
Security or Diagnostics—services that almost all other components require.
In short, there is no good location for Utilities among the layers of a closed archi-
tecture. The Method places Utilities in a vertical bar on the side of the layers (see
Figure 3-4). This bar cuts across all layers, allowing any component in the archi-
tecture to use any Utility.
You may see attempts by some developers to abuse the utilities bar by christening
as a Utility any component they wish to short-circuit across all layers. Not all
5. https://en.wikipedia.org/wiki/OSI_model
78 C HAPTER 3 S TRUCTURE
components can reside in the utilities bar. To qualify as a Utility, the compo-
nent must pass a simple litmus test: Can the component plausibly be used in any
other system, such as a smart cappuccino machine? For example, a smart cappuc-
cino machine could use a Security service to see if the user can drink coffee.
Similarly, the cappuccino machine may want to log how much coffee the office
workers drink, have diagnostics, and be able to use the Pub/Sub service to pub-
lish an event notifying that it is running low on coffee. Each of these needs justifies
encapsulation in a Utility service. In contrast, you will be hard-pressed to explain
why a cappuccino machine has a mortgage interest calculating service as a Utility.
Queued Manager-to-Manager
While Managers should not call directly sideways to other Managers, a Manager
can queue a call to another Manager. There are actually two explanations—a
technical one and a semantic one—why this does not violate the closed architec-
ture principle.
The technical explanation involves the very mechanics of a queued call. When a
client calls a queued service, the client interacts with a proxy to the service, which
then deposits the message into a message queue for the service. A queue listener
entity monitors the queue, detects the new message, picks it off the queue, and
6. Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns: Elements of
Reusable Object-Oriented Software (Addison-Wesley, 1994).
O PEN AND C LOSED A RCHITECTURES 79
calls the service. Using The Method structure, when a Manager queues a call
to another Manager, the proxy is a ResourceAccess to the underlying Resource,
the queue; that is, the call actually goes down, not sideways. The queue listener
is effectively another Client in the system, and it is also calling downward to the
receiving Manager. No sideways call actually takes place.
The semantic explanation involves the nature of use cases. Business systems quite
commonly have one use case that triggers a latent, much-deferred execution of
another use case. For example, imagine a system in which a Manager executing a
use case must save some system state for analysis at the end of the month. Without
interrupting its flow, the Manager could queue the analysis request to another
Manager. The second Manager could dequeue at the month’s end and perform
its analysis workflow. The two use cases are independent and decoupled on the
timeline.
DESIGN “DON’TS”
With the defi nitions in place for both the services and the layers, it is also possible
to compile a list of things to avoid—the design “don’ts.” Some of the items on the
list may appear obvious to you after the previous sections, yet I have seen them
often enough to conclude they were not obvious after all. The main reason people
go against a “don’t” guideline is that they have created a functional decomposition
and have managed to convince themselves that it is not functional.
If you do one of the things on this list, you will likely live to regret it. Treat any
violation of these rules as a red flag and investigate further to see what you are
missing:
• Clients do not call multiple Managers in the same use case. Doing so implies
that the Managers are tightly coupled and no longer represent separate families
of use cases, or separate subsystems, or separate slices. Chained Manager calls
from the Client indicate functional decomposition, requiring the Client to stitch
the underlying functionalities together (see Figure 2-1). Clients can call multiple
Managers but not in the same use case; for example, a Client can call Manager
A to perform use case 1 and then call Manager B to perform use case 2.
• Clients do not call Engines. The only entry points to the business layer are the
Managers. The Managers represent the system, and the Engines are really an
internal layer implementation detail. If the Clients call the Engines, use case
sequencing and associated volatility are forced to migrate to the Clients, pollut-
ing them with business logic. Calls from Clients to Engines are the hallmark of
a functional decomposition.
• Managers do not queue calls to more than one Manager in the same use case.
If there are two Managers receiving a queued call, why not a third? Why not all
of them? The need to have two (or more) Managers respond to a queued call is
a strong indication that more Managers (and maybe all of them) would need to
respond, so you should use a Pub/Sub Utility service instead.
• Engines do not receive queued calls. Engines are utilitarian and exist to execute
a volatile activity for a Manager. They have no independent meaning on their
own. A queued call, by defi nition, executes independently from anything else
in the system. Performing just the activity of an Engine, disconnected from any
use case or other activities, does not make any business sense.
• ResourceAccess services do not receive queued calls. Very similar to the Engines
guideline, ResourceAccess services exist to service a Manager or an Engine and
have no meaning on their own. Accessing a Resource independently from any-
thing else in the system does not make any business sense.
O PEN AND C LOSED A RCHITECTURES 81
• Clients do not publish events. Events represent changes to the state of the system
about which Clients (or Managers) may want to know. A Client has no need
to notify itself (or other Clients). In addition, knowledge of the internals of the
system is often required to detect the need to publish an event—knowledge that
the Clients should not have. However, with a functional decomposition, the
Client is the system and needs to publish the event.
• Engines do not publish events. Publishing an event requires noticing and
responding to a change in the system and is typically a step in a use case exe-
cuted by the Manager. An Engine performing an activity has no way of know-
ing much about the context of the activity or the state of the use case.
• ResourceAccess services do not publish events. ResourceAccess services have
no way of knowing the significance of the state of the Resource to the system.
Any such knowledge or responding behavior should reside in Managers.
• Resources do not publish events. The need for the Resource to publish events
is often the result of a tightly coupled functional decomposition. Similar to the
case for ResourceAccess, business logic of this kind should reside in Managers.
As a Manager modifies the state of the system, the Manager should also publish
the appropriate events.
• Engines, ResourceAccess, and Resources do not subscribe to events. Processing
an event is almost always the start of some use case, so it must be done in a
Client or a Manager. The Client may inform a user about the event, and the
Manager may execute some back-end behavior.
• Engines never call each other. Not only do such calls violate the closed architec-
ture principle, but they also do not make sense in a volatility-based decomposi-
tion. The Engine should have already encapsulated everything to do with that
activity. Any Engine-to-Engine calls indicate functional decomposition.
• ResourceAccess services never call each other. If ResourceAccess services
encapsulate the volatility of an atomic business verb, one atomic verb cannot
require another. This is similar to the rule that Engines should not call each
other. Note that a 1:1 mapping between ResourceAccess and Resources (every
Resource has its own ResourceAccess) is not required. Often two or more
Resources logically must be joined together to implement some atomic business
verbs. A single ResourceAccess service should perform the join rather than
inter-ResourceAccess services calls.
pressures apply to software systems as well, forcing the systems to respond to the
changing environment or become extinct. The quest for symmetry, however, is only
at the architecture level, not in detailed design. Certainly, your internal organs are
not symmetric because such symmetry offered no evolutionary advantage to your
ancestors (i.e., the system dies when you expose its internals).
The symmetry in software systems manifests in repeated call patterns across use
cases. You should expect symmetry, and its absence is a cause for concern. For
example, suppose a Manager implements four use cases, three of which publish an
event with the Pub/Sub service and the fourth of which does not. That break of
symmetry is a design smell. Why is the fourth case different? What are you missing
or overdoing? Is that Manager a real Manager, or is it a functionally decomposed
component without volatility? Symmetry can also be broken by the presence of
something, not just by its absence. For example, if a Manager implements four
use cases, of which only one ends up with a queued call to another Manager, that
asymmetry is also a smell. Symmetry is so fundamental for good design that you
should generally see the same call patterns across Managers.
4
COMPOSITION
Your software system’s reason for being is to service the business by addressing its
customers’ requirements and needs. The previous two chapters discussed how to
decompose the system into its components to create its architecture. The decompo-
sition into components is inherently a static layout of the system, like a blueprint.
During execution, the system is dynamic, and the various components interact with
each other. But how do you know the composition of these components at run-
time adequately satisfies all the requirements? Validating your design has to do with
requirements analysis, system design, and your added value as the architect. Design
validation and composition, as you will see, are intimately related. You can and must
be able to produce a viable design and validate it in a repeatable manner.
This chapter provides you with the tools to verify that the system not only
addresses the current requirements but also can withstand future changes to the
requirements. That objective requires fi rst recognizing the nature of requirements
and change, and how both relate to system design. This recognition, in turn, yields
a fundamental observation about system design along with practical recommenda-
tions for producing a valid design.
83
84 C HAPTER 4 C OMPOSITION
RESENTING CHANGE
As wonderful as changes to requirements are, many in the industry have spent their
entire career resenting such changes. The reason is simple: Most developers and
architects design their system against the requirements. In fact, they go to great
lengths to transcribe the requirements to components of the architecture. They strive
to maximize the affinity between the requirements and the system design. However,
when the requirements change, their design also must change. In any system, a
change to the design is very painful, often destructive, and always expensive. Since
nobody likes pain (even when it is self-inflicted, as in this case), people have learned
to resent changes to requirements, literally resenting the hand that feeds them.
This simple directive goes contrary to what most have been taught and have prac-
ticed, even though it should have been plainly evident for all to see. Any attempt
at designing against the requirements will always guarantee pain. Since pain is
bad, there should be no excuse for doing something that is so ill advised. People
may even be fully aware that their design process cannot work and has never
worked, but lacking alternatives they resort to the one option they know—to
design against the requirements.
Note The perils of designing against the requirements are not limited to
software systems. Chapter 2 discussed the maddening experience of build-
ing a house while designing against the required functionality.
Futility of Requirements
As discussed in Chapter 3, the correct way of capturing the requirements is in
the form of use cases: the required set of behaviors of the system. A decent-sized
system has dozens of these use cases, and large systems may have hundreds. At
the same time, no one in the history of software has ever had the time to correctly
spec-out hundreds of use cases at the beginning of the project.
Suppose that on day 1 of a new project you are given a folder with 300 use cases.
Can you trust that this collection is correct and complete? Would you be surprised
C OMPOSABLE D ESIGN 85
to learn that the real number was actually 330 use cases and that you are missing a
few use cases? If you are given 300 use cases, will you be shocked to learn that the
real number was actually 200 because the requirements spec contains many dupli-
cates? In this case, if you design against the requirements, will you not be doing
at least 50% more work? Is it impossible for you to receive a set of use cases in
which some of the use cases are mutually exclusive? How about the risk of having
defective use cases that compel you to implement the wrong behavior?
Even if by some miracle someone did take the considerable time needed to cor-
rectly capture all 300 use cases in activity diagrams, to confi rm there are no miss-
ing use cases, to reconcile the mutually exclusive use cases, and to consolidate the
duplicate use cases, that effort will be of little value because requirements will
change. Over time you will have new requirements, some existing requirements
will be removed, and other requirements will just change in place. In short, any
attempt of gathering the complete set of requirements and designing against them
is an exercise in futility.
COMPOSABLE DESIGN
Before prescribing the correct way of satisfying the requirements, you have to set
the bar correctly for satisfying the requirements. The goal of any system design
is to be able to satisfy all use cases. The word all in the previous sentence really
means all: present and future, known and unknown use cases. This expectation
is where the bar is set. Nothing less will do. If you fail to pass that bar, then at
some point in the future, when the requirements change, your design will have to
change. The hallmark of a bad design is that when the requirements change, the
design has to change as well.
While the system may have hundreds of use cases, the saving grace is that the
system will have only a handful of core use cases. In our practice at IDesign, we
commonly see systems with surprisingly few core use cases. Most systems have as
few as two or three core use cases, and the number seldom exceeds six. Reflect on
your system at the office or on a recent project with which you were involved and
count in your head the number of truly distinct use cases the system is required to
handle. You will fi nd that this number is small, very small. Alternatively, bring up
a single-page marketing brochure for the system and count the number of bullets.
You will likely have no more than three bullets.
I call this approach composable design. A composable design does not aim to sat-
isfy any use case in particular.
You do not target any use case in particular not just because the use cases you were
given were incomplete, faulty, and full of holes and contradictions, but because
they will change. Even if the existing use case will not change, over time you will
have new use cases added and others removed.
A simple example is the design of the human body. Homo sapiens appeared on the
plains of Africa more than 200,000 years ago, when the requirements at the time
did not include being a software architect. How can you possibly fulfill the require-
ments for a software architect today while using the body of a hunter-gatherer? The
answer is that while you are using the same components as a prehistoric man, you
are integrating them in different ways. The single core use case has not changed
with time: survive.
Architecture Validation
Since the goal of any system is to satisfy the requirements, composable design
enables something else: design validation. Once you can produce an interaction
between your services for each core use case, you have produced a valid design.
There is no need to know the unknown or to predict the future. Your design can
now handle any use case, because all use cases manifest themselves only as differ-
ent interactions between the same building blocks. Stop yearning for some mythi-
cal project where one day someone will give you all the requirements complete and
properly documented. There is no point in wasting inordinate amount of time up
front trying to nail down the requirements in minute detail. You can easily design
valid systems even with grossly impaired requirements.
Client A
Manager A Pub/Sub
Engine B Engine A
ResAccess
A
Resource
A
Figure 4-1 Simple call chain demonstrating support of a core use case
Call chain diagrams are a simple and quick way of examining a use case and
demonstrating how the design supports it. The downside of call chain diagrams is
that they have no notion of call order, they have no way of capturing the duration
of the calls, and they get confusing when multiple parties make multiple calls to
the same type of components. In many cases, the interaction between the compo-
nents may be simple, so you do not need to show order, duration, or multiple calls.
For these cases, you may decide that a call chain diagram is good enough for the
validation purpose. Also, call chains are often easier for nontechnical audiences
to understand.
In a sequence diagram, each participating component in the use case has a vertical
bar representing its lifeline. The vertical bars correspond to some work or activity
that the component performs. Time flows from top to bottom of the diagram,
and the length of the bars indicates the relative duration of the component’s use.
A single component may participate multiple times in the same use case, and you
1. https://en.wikipedia.org/wiki/Sequence_diagram
C OMPOSABLE D ESIGN 89
Figure 4-2 Demonstrating support for a core use case with a sequence diagram
can even have different lifelines for different instances of the same component.
The horizontal arrows (solid black for synchronous and dashed gray for queued)
indicate calls between components.
Sequence diagrams take longer to produce due to the additional level of details
they offer, but they are often the right tool to demonstrate a complex use case,
especially for technical audience. In addition, sequence diagrams are extremely
useful in subsequent detailed design for helping to defi ne interfaces, methods, and
even parameters. If you are going to produce them for the detailed design, you
might as well produce them fi rst for design validation, albeit with fewer details
(e.g., omit operations and messages for now).
Smallest Set
Remember: Your mission as the architect is to identify not just a set of components
that you can put together to satisfy all the core use case, but the smallest set of
components. Why smallest? And what does smallest even mean?
In general, you should produce an architecture that minimizes, rather than max-
imizes, the amount of work involved in detailed design and implementation. Less
is more when it comes to architecture. That said, a natural constraint exists on the
number of components in any architecture. For example, suppose you are given a
requirements spec with 300 use cases. On the one hand, a single-component archi-
tecture satisfying these requirements constitutes the ultimate smallest number of
components, but such a monolith is a horrible design due to its internal complexity
(see Appendix B for an in-depth discussion of the effect of service size on cost). On
the other hand, if you create an architecture consisting of 300 components, each
90 C HAPTER 4 C OMPOSITION
corresponding to a single use case, that is not a good design either, due to the high
integration cost. Somewhere between 1 and 300 components is the good enough
number.
The smallest set of services required in a typical software system contains 10 services
in order of magnitude (e.g., sets of both 12 and 20 are on the order of 10). This par-
ticular order of magnitude is another universal design concept. How many internal
components does your body have, as an order of magnitude? Your car? Your laptop?
For each, the answer is about 10 because of combinatorics. If the system supports
the required behaviors by combining the 10 or so components, a staggering number
of such combinations becomes possible even without allowing repetition of partici-
pating components or partial sets. As a result, even a small number of valid internal
components can support an astronomical number of possible use cases.
settled on the core use cases and the areas of volatility, how long will it take you to
produce a valid design using The Method? You can use orders of magnitude here
as well: Is it more like an hour? A day? A week? A month? A year? I expect most
readers of this book to pick a day or a week, and with practice you can bring the
time down to a few hours. Design is not time-consuming if you know what you
are doing.
THERE IS NO FEATURE
Putting together the observations of this chapter along with the previous two
chapters reveals this fundamental system design rule:
This is a universal design rule which governs the design and implementation of
all systems. As mentioned in Chapter 2, the very nature of the word “universal”
includes software systems.
What is even more impressive with this rule is that it is a fractal. For example, I
am typing the manuscript for this book right now on a laptop, which provides me
with a very important feature: word processing. But is there any box in the archi-
tecture of the laptop called Word Processing? The laptop provides the feature
of word processing by integrating the keyboard, the screen, the hard drive, the
bus, the CPU, and the memory. Each of these components provides a feature, too:
The CPU provides computation, and the hard drive provides storage. Yet if you
examine the feature of storage, is there a single block in the drive’s design called
Storage? The hard drive provides the storage feature by integrating internal
components such as memory, the internal data bus, media, cables, ports, power
regulators, and small screws that hold everything together. The screws themselves
provide a very important feature: fastening. But how does a screw provide fasten-
ing? The screw performs the fastening by integrating the head of the screw, the
thread, and its stem. The integration of these provides the fastening feature. You
can keep drilling down this way all the way to the quarks, and you will never see
a feature.
92 C HAPTER 4 C OMPOSITION
Read the design rule just given a second time. If you still fi nd it hard to accept, you
have been plugged into a matrix that is telling you to write code that implements a
feature. Doing so goes against the way the universe is actually put together. There
is no feature.
HANDLING CHANGE
Your software system must respond to changes in the requirements. Most soft-
ware systems are implemented using functional decomposition, which maximizes
the effects of the change. If the design has been based on features, the change, by
defi nition, is never in one place. Instead, it is spread across multiple components
and aspects of the system. With functional decomposition, change is expensive
and painful so people do their best to avoid the pain by deferring the change. They
will add the change request to the next semi-annual release because they prefer
to take future pain over present pain. They may even fight the change outright by
explaining to the customer that the requested change is a bad idea.
Unfortunately, fighting the change is tantamount to killing the system. Live sys-
tems are systems that customers use, and dead systems are systems that customers
do not use (even if they still pay for them). When developers tell customers that
the change will be part of a future release, what do they expect the customers to
do in the subsequent six months until the developers roll out the requested change?
The customers do not want the feature six months in the future—they need the
feature now. Consequently, the customers will have to work around the system by
using the legacy system, or some external medium, or a competing product. Since
fighting the change results in pushing customers away from using the system, fight-
ing the change is killing the system. Part and parcel of responding to the change is
responding quickly, even if that aspect was never explicitly stated.
Recall from Chapter 3 that the Manager should be almost expendable. This enables
you to absorb the cost of the change, to contain it. Furthermore, the bulk of the
effort in any system typically goes into the services that the Manager uses:
When a change happens to the Manager, you get to salvage and reuse all the effort
that went into the Clients, the Engines, the ResourceAccess, the Resources, and
the Utilities. By reintegrating these services in the Manager, you have contained
the change and can quickly and efficiently respond to changes. Is that not the
essence of agility?
This page intentionally left blank
5
SYSTEM DESIGN EXAMPLE
The previous three chapters presented the universal design principles for system
design. However, most people learn best by example. Therefore, this chapter
demonstrates the application of concepts from prior chapters with a comprehen-
sive example: a case study. The case study describes the design of a new system
called TradeMe, a replacement for a legacy system. The case study is derived
directly from an actual system that IDesign designed for one of its customers,
albeit with the specific business details scrubbed and obfuscated. The essence
of the system remains unchanged, from the business case to the decomposition:
I have not glossed over issues or tried to beautify the situation. As mentioned in
Chapter 1, design should not be time-consuming. In this case, the design was com-
pleted in less than a week by a two-person design team consisting of a seasoned
IDesign architect and an apprentice.
The goal of this case study is to show the thought process and the deductions used
to produce the design. These are often difficult to learn on your own, but are more
easily understood by watching somebody else do it while reasoning about what is
taking place. This chapter starts with an overview of the customer and the system,
then presents the requirements in the form of several use cases. The identification
of the areas of volatility and the architecture relies on The Method structure.
95
96 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
SYSTEM OVERVIEW
TradeMe is a system for matching tradesmen to contractors and projects. Tradesmen
may be plumbers, electricians, carpenters, welders, surveyors, painters, telephone
network technicians, gardeners, and solar panel installers, among others. They all
work independently and are self-employed. Each tradesman has a skill level, and
some, such as electricians, are certified by regulators to do certain tasks. The pay-
ment rate for the tradesman varies based on various factors such as discipline (weld-
ers are paid more than carpenters), skill level, years of experience, project type,
location, and even weather. Other factors affecting their work include regulatory
compliance issues (such as minimum wage or employment taxes), risk premium
(such as exterior work on skyscrapers or with high voltage), certification of trades-
men’s qualifications for certain kinds of task (such as welding girders or power grid
tie-in), reporting requirements, and more.
The contractors are general contractors, and they need tradesmen on an ad hoc
basis, from as little as a day to as long as a few weeks. Contractors often have a
base crew of generalists whom they employ outside the system on a full-time basis,
using TradeMe for the specialized work. On the same project, different tradesmen
are needed for different periods of time (one for a day, another for a week) at dif-
ferent times. Tradesmen can come and go on a single project.
The TradeMe system allows tradesmen to sign up, list their skills, their general
geographic area of availability, and the rate they expect. It also allows contractors
to sign up, list their projects, the required trades and skills, the location of the
projects, the rates they are willing to pay, the duration of engagement, and other
attributes of the project. Contractors can even request (but not insist upon) specific
tradesmen with whom they would like to work.
Other than the factors already mentioned, the rate the contractor is willing to
pay depends on supply and demand. When a project is idle, the contractor will
increase the price. When the tradesman is idle, the tradesman will lower the price.
Similar consideration is given to the duration or requested commitment. The ideal
project for a tradesman often pays a high rate and has a short duration. Once the
tradesmen have committed to a project, they have to stay for the amount of time to
which they committed. Contractors may offer more pay with longer commitments.
In general, the system lets market forces set the rate and fi nd equilibrium.
The projects are construction projects for buildings. The system may also be use-
ful in newly emerging markets, such as oil fields or marine yards.
S YSTEM O VERVIEW 97
TradeMe allows the tradesmen and the contractors to fi nd one another. The sys-
tem processes the requests and dispatches the required tradesmen to the work
sites. It also keeps track of the hours and wages, and the rest of the reporting to
the authorities, saving both contractors and tradesmen the hassle of handling these
tasks themselves.
The system isolates tradesmen from contractors. It collects funds from the con-
tractors and pays the tradesmen. Contractors cannot bypass the system and hire
the tradesmen directly because the tradesmen have exclusivity with the system.
The TradeMe system aims to find the best rate for the tradesmen and the most
availability for the contractors. It makes money on the small spread between the
ask rate and the bid rate. Another source of income is the membership fee that both
tradesmen and contractors pay. The fee is collected annually but that could change.
Consequently, both tradesmen and contractors are called members in the system.
Presently, nine call centers handle the majority of the assignments. Each call cen-
ter is specific to a particular locale, regulations, building codes, standards, and
labor laws. The call centers are staffed with account representatives called reps.
The reps today rely on experience to optimize the scheduling across all projects
and available tradesmen. Some call centers operate as their own business, whereas
others are operated by the same business.
There is also at least one competing application geared more toward fi nding the
cheapest tradesmen, and some contractors prefer that system. Contractors opting
for tradesmen based on price as opposed to availability could be a growing trend.
LEGACY SYSTEM
The legacy system, which is deployed in European call centers, has full-time users
who rely on a two-tier desktop application connected to a database. Both trades-
men and contractors call in, with the reps entering the details and even performing
the matching in real time. Some rudimentary web portals for managing member-
ship bypass the legacy system and work with the database directly. The various
subsystems are isolated and very inefficient, requiring a lot of human intervention
at almost every step. Users are required to employ as many as five different appli-
cations to accomplish their tasks. These applications are independent, and the
integration is done by hand. The client applications are chock-full of business
logic, and the lack of separation between UI and business logic prevents updating
the applications to modern user experience.
98 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
Each subsystem even has its own repository, and the users have to reconcile them
to make sense of it all. This process is error prone and imposes expensive training
and onboarding time for new users.
The legacy system is vulnerable, and its haphazard approach to security exposes
it to many possible attack vectors. The legacy system was never designed with
security in mind. For that matter, it was never designed at all, but rather grew
organically.
The legacy simply cannot accommodate several new features and desirable
capabilities:
Both the business and users are frustrated with the legacy system’s inability to
keep up with the times, and there is a never-ending stream of desired value-added
features. One such feature, continuing education, turned out to be a must-have, so
it was cobbled on top of the legacy system. The legacy system assigns tradesmen
to certification classes and government-required tests and tracks the tradesmen’s
progress. Although external education centers provide training and register the
certifications, the users have to manually connect them with the legacy system.
While unrelated to the core system aspects, tradesmen are really keen on this fea-
ture, as is the business, because the certification feature helps prevent tradesmen
from moving to the competitors.
The legacy system is having trouble complying with new legislation across locales.
Dealing with any change is very difficult, and the system is highly specific for its
current business context. Since the company cannot afford to support a unique
version of the system per locale, it created an incentive to dumb down the system
to the lowest common denominator across locales. This further increases the bur-
den on the users in terms of their manual workflows, which decreases efficiency,
increases training time and costs, and causes loss of business opportunities.
S YSTEM O VERVIEW 99
Overall, the system has some 220 reps across all locations. Neither scalability nor
throughput poses a problem. However, responsiveness is an issue, although this
may just be a side effect of the legacy system.
NEW SYSTEM
Given the issues of the poorly designed legacy, the company’s management is
interested in designing a new system correctly. The new system should automate
the work as much as possible. Ideally, the company would like to have a single,
small call center that is used as a backup for an automated process. This call cen-
ter would use a single system across all locales. While the system is deployed in
Europe, there are requests to deploy the system in the United Kingdom1 and even
Canada (i.e., outside the European Union). Another driver for investing in the
new system is that the competitors have much more flexible, efficient systems, with
superior user experience.
THE COMPANY
The company views itself as a tradesmen broker, not as a software organization.
Software is not its business. In the past the company did not acknowledge what
it would really take to develop great software. The company did not devote ade-
quate effort to process or development practices. The company’s attempts to build
a replacement system in the past all failed. What the company does have is plenty
of fi nancial resources—the legacy application is very profitable. The bitter lessons
of the past have convinced management to turn a new page and adopt a sound
approach for software development.
1. While this design effort took place prior to Brexit (the departure of the United Kingdom from the EU),
Brexit is a classic example of a massive change that was unanticipated at the time, yet the new system
accommodated it seamlessly.
100 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
USE CASES
There were no existing requirements documents for either the old system or the
new system. The customer was able to provide Figures 5-1 to 5-8, depicting some
use cases. These may or may not be core use cases; they are simply the required
behaviors of the system. To a large extent, the use cases reflected what the legacy
system was supposedly doing. Since the design team was looking for core use cases,
they ignored low-level use cases such as entering fi nancial details, collecting fees
from contractors, and distributing payments to tradesmen. Some use cases, such as
continuing education, were not even specified. Moreover, there was clearly room
for additional use cases complementing the use cases provided by the company.
Verify Application
[No] [Yes]
Valid?
Process Payment
Request Tradesman
Verify Request
[No] [Yes]
Valid?
Request Match
Verify Request
[No] [Yes]
Valid?
Verify Request
[No] [Yes]
Valid?
Request Termination
Verify Request
[No] [Yes]
Valid?
Find Scheduled
Payments
Pay Tradesmen
Verify Request
[No] [Yes]
Valid?
Verify Request
[No] [Yes]
Valid?
Close Project
Note Even though for design validation you need to support just the
core use cases, that does not mean you should ignore the other use cases—
far from it. A great way of demonstrating the versatility of your design is
to show how easy it will be to support all the other use cases and anything
else the business might require of the system.
S YSTEM O VERVIEW 105
It is useful to show the flow of control between roles, organizations, and other
responsible entities, using “swim lanes” in your activity diagrams. For example,
Figure 5-9 provides an alternative way of expressing the Terminate Tradesman use
case from Figure 5-5.
[No] [Yes]
Error Valid?
Tradesman Available
You transform the raw use case by subdividing the activity diagram into areas of
interactions. This also helps clarify the required behavior of the system by adding
decision boxes or synchronization bars as required. You will see how to use the
swim lanes technique later on in this chapter to both initiate and validate the design.
106 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
THE MONOLITH
One simple anti-design example is a god service—an ugly dumping ground of all
the functionalities in the requirements, all implemented in one place. While this
design is so common it even has a name (the Monolith), by now most people have
learned the hard way not to design this way.
With so many fi ne-grained blocks, the Clients become responsible for implement-
ing the business logic of the use cases, as shown in Figure 5-11. Contaminating
Verify Verify Verify Verify Verify Verify Find Verify Verify Close
Activate
Tradesman Contractor Tradesman Match Assign Termination Scheduled Project Project
Project
Application Application Request Request Request Request Payments Request Request
Database
Client
Database
the Client’s code with business logic results in a bloated Client in which the entire
system migrates to the Client, as shown in Figure 2-1.
Alternatively, you can have the services call each other, as shown in Figure 5-12.
However, chaining the highly functional services together in this way creates cou-
pling between them, as depicted in Figure 2-5. Note also in Figure 5-12 the open
architecture issues of calling up and sideways.
Client
Process
Tradesman
Payment
Approve
Tradesman Database
DOMAIN DECOMPOSITION
Another classic anti-design is to decompose along the domain lines, as shown in
Figure 5-13. Here the system is decomposed along the domain lines of Tradesman,
Contractor, and Project.
Client
Database
Even with a relatively simple system such as TradeMe, there are nearly lim-
itless additional possibilities for domain decomposition, such as Accounts ,
Administration, Analytics, Approval, Assignment, Certificates,
Contracts, Currency, Disputes, Finance, Fulfillment, Legislation,
Payroll, Reports, Requisition, Staffing, Subscription, and so on.
Who is to say that Project is a better candidate for a domain than Accounts?
And which criteria should be used to make that judgment?
BUSINESS ALIGNMENT
It is of the utmost importance to recognize that the architecture does not exist for
its own sake. The architecture (and the system) must serve the business. Serving
the business is the guiding light for any design effort. As such, you must ensure
that the architecture is aligned with the vision that the business has for its future
and with the business objectives. Moreover, you must have complete bidirectional
traceability from the business objectives to the architecture. You must be able to
easily point out how each objective is supported in some way by the architecture,
B USINESS A LIGNMENT 109
and how each aspect of the architecture is derived from some objectives of the
business. The alternatives are pointless designs and orphaned business needs.
As discussed in the previous chapters, the architect producing the design has to first
recognize the areas of volatility and then encapsulate these areas in system compo-
nents, operational concepts, and infrastructure. The integration of the components
is what supports the required behaviors, and the way the integration takes place is
what realizes the business objectives. For example, if a key objective is extensibility
and flexibility, then integrating the components over a message bus is a good solu-
tion (more on this point later). Conversely, if the key objective is performance and
simplicity, introducing a message bus contributes too much complexity.
The rest of this chapter provides a detailed walkthrough of the steps that transform
the business needs into a design for TradeMe. These steps start with capturing the
system vision and the business objectives, which then drive the design decisions.
THE VISION
Seldom will everyone in any environment share the same vision as to what the
system should do. Some may have no vision at all. Others may have a different
vision than the rest or a vision that serves only their narrow interests. Some may
misinterpret the business goals. The company behind TradeMe was stymied by a
myriad of additional issues resulting from its failure to keep up with the chang-
ing market. These issues were reflected in the existing systems, in the company’s
structure, and in the way software development was set up. The new system had
to tackle all the issues head-on rather than in a piecemeal fashion, because solving
just some of them was insufficient for success.
The fi rst order of business is to get all stakeholders to agree on a common vision.
The vision must drive everything, from architecture to commitments. Everything
that you do later has to serve that vision and be justified by it. Of course, this cuts
both ways—which is why it is a good idea to start with the vision. If something
does not serve the vision, then it often has to do with politics and other secondary
or tertiary concerns. This provides you with an excellent way of repelling irrele-
vant demands that do not support the agreed-upon vision. In the case of TradeMe,
the design team distilled the vision to a single sentence:
A good vision is both terse and explicit. You should read it like a legal statement.
110 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
Note that the vision for TradeMe was to build a platform on which to build the
applications. This kind of platform mindset addressed the diversity and extensibil-
ity the business craved and may be applicable in systems you design.
1. Unify the repositories and applications. The legacy system had entirely too
many inefficiencies, requiring a lot of human intervention to keep the system
up to date and running.
2. Quick turnaround for new requirements. The legacy turnaround time for fea-
tures was abysmal. The new platform had to allow very fast, frequent custom-
ization, often tailored just for a specific skill, time of the week, project type,
and any combination of these. Ideally, much of this quick turnaround should
be automated, from coding to deployment.
3. Support a high degree of customization across countries and markets. Local-
ization was an incredible pain point because of differences in regulations,
legislations, cultures, and languages.
4. Supports full business visibility and accountability. Fraud detection, audit
trails, and monitoring were nonexistent in the legacy system.
5. Forward looking on technology and regulations. Instead of being in perpetual
reactive mode, the system must anticipate change. The company envisioned
that this was how TradeMe would defeat the competitors.
6. Integrate well with external systems. Although somewhat related to the previ-
ous objective, the objective here is to enable a high degree of automation over
previously laborious manual processes.
7. Streamline security. The system must be properly secured, and literally every
component must be designed with security in mind. To meet the security
objective, the development team must introduce security activities such as
security audits into the software life cycle and support it in the architecture.
THE A RCHITECTURE 111
Note Development cost was not an objective for this system. While no
one likes wasting money, the business pain was in the items listed here, and
the company could afford even an expensive solution that addressed the
objectives.
MISSION STATEMENT
It may come as a surprise, but articulating the vision (what the business will
receive) and the objectives (why the business wants the vision) is often insufficient.
People are usually too mired in the details and cannot connect the dots. Thus,
you should also specify a mission statement (how you will do it). The TradeMe
Mission Statement was:
This mission statement deliberately does not identify developing features as the
mission. The mission is not to build features—the mission is to build components.
It now becomes much easier to justify volatility-based decomposition that serves
the mission statement because all the dots are connected:
In fact, you have just compelled the business to instruct you to design the right
architecture. This is a reversal of the typical dynamics, in which the architect
pleads with management to avoid functional decomposition. It is a lot easier to
drive the correct architecture through the business by aligning the architecture
with the business’s vision, its objectives, and the mission statement. Once you have
them agree on the vision, the objectives, and then the mission statement, you have
them on your side. If you want the business people to support your architecture
effort, you must demonstrate how the architecture serves the business.
THE ARCHITECTURE
Misunderstanding and confusion are endemic with software development and
often lead to conflict or unmet expectations. Marketing may use different terms
than engineering for the same thing or—even worse—may use the same term but
mean a different thing. Such ambiguities may go undetected for years. Before you
dive into the act of system design, ensure everyone is on the same page by compil-
ing a short glossary of domain terminology.
112 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
TRADEME GLOSSARY
A good way of starting a glossary is to answer the four classic questions of “who,”
“what,” “how,” and “where.” You answer the questions by examining the system
overview, the use cases, and customer interview notes, if you have any. For TradeMe,
the answers to the four questions were as follows:
• Who
- Tradesmen
- Contractors
- TradeMe reps
- Education centers
- Background processes (i.e., scheduler for payment)
• What
- Membership of tradesmen and contractors
- Marketplace of construction projects
- Certificates and training for continuing education
• How
- Searching
- Complying with regulations
- Accessing resources
• Where
- Local database
- Cloud
- Other systems
Recall from Chapter 3 that you often can map the answers to the four questions
to layers, if not to components of the architecture itself.
The list of the “what” is of particularly interest because it hints strongly at possible
subsystems or the swim lanes mentioned previously. You can use the swim lanes and
the answers to seed and initiate your decomposition effort as you look for areas of vol-
atility. This does not preclude having additional subsystems or imply that these will
necessarily be all the subsystems needed—you always decompose based on volatility,
and if a “what” is not volatile, then it will not merit a component in the architecture.
At this point all it provides is a nice starting point to reason about your design.
THE A RCHITECTURE 113
There is nothing wrong with suggesting certain areas of volatility, and then exam-
ining the resultant architecture. If the result produces a spiderweb of interactions
or is asymmetric, then the design is unlikely to be good. You will probably sense
whether the design is correct or not.
114 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
Sometimes an area of volatility may reside outside the system. For example, while
payments may very well be a volatile area due to the various ways in which you
could issue payments, TradeMe as a software project was not about implementing
a payment system. The payments are ancillary to the core value of the system.
The system will likely use a number of external payments systems as Resources.
Resources may be whole systems, each with its own volatilities, but these are out-
side the scope of this system.
The design team produced the following list of areas volatile enough to affect the
architecture. The list also identifies the corresponding components of the architec-
ture that encapsulate the areas of volatility:
• Client applications. The system should allow several distinct client environments
to evolve separately at their own pace. The clients cater to different users (trades-
men, contractors, marketplace reps, or education centers) or to background pro-
cesses, such as a timer that periodically interacts with the system. These client
applications may use different UI technologies, devices, or APIs (perhaps the edu-
cation portal is a mere API); they may be accessed locally or across the Internet
(tradesmen versus reps); they may be connected or disconnected; and so on. As
expected, the clients are associated with a lot of volatility. Each one of these vola-
tile client environments is encapsulated in its own Client application.
• Managing membership. There is volatility in the activities of adding or remov-
ing tradesmen and contractors, and even the benefits or discounts they get.
Membership management changes across locales and over time. These volatili-
ties are encapsulated in the Membership Manager.
• Fees. All the possible ways TradeMe can make money, combining volume and
spread, are encapsulated in the Market Manager.
• Projects. Project requirements and size not only change but also are volatile and
affect the required behavior. Small projects may require different workflows
from large projects. The system encapsulates projects in the Market Manager.
• Disputes. When dealing with people, at best misunderstandings will arise; at
worst outright fraud happens. The volatility in handling dispute resolution is
encapsulated by the Membership Manager.
• Matching and approvals. Two volatilities come into play here. The volatility of
how to fi nd a tradesman that matches the project needs is encapsulated in the
Search Engine. The volatility of search criteria and the defi nition thereof is
encapsulated in the Market Manager.
• Education. There is volatility in matching a training class to a tradesman and
in searching for an available class or a required class. Managing the education
workflow volatility is encapsulated in the Education Manager. Searching for
THE A RCHITECTURE 115
Weak Volatiles
Two additional, weaker areas of volatility are not reflected in the architecture:
• Notification. How the Clients communicate with the system and how the sys-
tem communicates with the outside world could be volatile. The use of the
Message Bus Utility encapsulates that volatility. If the company had a strong
need for open-ended forms of transports such as email or fax, then perhaps a
Notification Manager would have been necessary.
• Analysis. TradeMe could analyze the requirements of projects and verify
the requested tradesmen or even propose them in the fi rst place. In this way,
TradeMe could optimize the tradesmen assignment to projects. The system
could analyze projects in various ways, with such analysis clearly being a vola-
tile area. However, the design team rejected analysis as an area of volatility in
the design because, as stated, the company is not in the business of optimizing
projects. Providing optimizations, therefore, falls into speculative design. Any
analysis activity required is folded into the Market Manager.
STATIC ARCHITECTURE
Figure 5-14 shows the static view of the architecture.
Resource Access
Resources
The Clients
The client tier contains a portal for each type of member, the tradesmen and the
contractors. There is also a portal for the education center to issue or validate
tradesman credentials and a marketplace application for the back-end users to
administer the marketplace. In addition, the client tier contains external processes
such as a scheduler or a timer that periodically initiates some behavior with the
system. These are included in the architecture for reference, but are not part of
the system.
There are only two Engines, which encapsulate some of the acute volatilities listed
previously. The Regulation Engine encapsulates the regulation and compli-
ance volatility between different countries and even in the same country over time.
The Search Engine encapsulates the volatility in producing a match, something
that can be done in an open-ended number of ways, ranging from a simple rate
lookup, to safety and quality record considerations, to AI and machine learning
techniques for the assignments.
Utilities
The system requires three Utilities: Security, Message Bus, and Logging.
Any future Utilities (e.g., instrumentation) would go in the Utility bar as well.
118 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
MESSAGE BUS
A message bus is merely a queued Pub/Sub (Figure 5-15). Any message posted
to the bus is broadcast to any number of subscribing parties. As such, a message
bus provides for general-purpose, queued, N:M communication, where N and M
can be any non-negative integers. If the message bus is down, or if the posting
party becomes disconnected, messages are queued in front of the message bus,
and then processed when connectivity is restored. This provides for availabil-
ity and robustness. If the subscribing party is down or disconnected (such as
a mobile device), the messages are posted to a private queue per subscriber,
and are processed when the subscriber becomes available. If both the posting
publisher and the subscriber are connected and available, then the messages are
asynchronous.
The choice of technology for the message bus has little to do with architecture
and, therefore, is outside the scope of this book. However, specific features pro-
vided by a particular message bus may greatly affect ease of implementation, so
choosing the right one requires careful consideration. Not all message buses are
created equal, including those from brand-name vendors. At the very least, the
message bus must support queuing, duplicating messages and multicast broad-
casting, headers and context propagation, securing both posting and retriev-
ing of messages, off-line work, disconnected work, delivery failure handling,
processing failure handling, poison message handling, transactional process-
ing, high throughput, a service-layer API, multiple-protocol support (especially
non-HTTP-based protocols), and reliable messaging. Optional features that
may be relevant include message fi ltering, message inspection, custom inter-
ception, instrumentation, diagnostics, automated deployment, easy integration
with credentials stores, and remote configuration. No single message bus prod-
uct will provide all these features. To mitigate the risk of choosing poorly, you
should start with a plain, easy-to-use, free message bus, and implement the
architecture initially with that message bus. This tactic allows you to better
understand the desired qualities and attributes, and to prioritize them. Only
then can you choose the best of breed that truly meets your needs.
Adding a message bus to your architecture does not eliminate the need to
impose architectural constraints on the communications patterns. For example,
you should disallow Client-to-Client communication across the bus.
THE A RCHITECTURE 119
Client
Pub/Sub
Manager
Client
OPERATIONAL CONCEPTS
With TradeMe, all communication between all Clients and all Managers takes
place over the Message Bus Utility. Figure 5-16 illustrates this operational concept.
Client A Client B
Manager Manager
Resource Resource
In this interaction pattern, the Clients and the business logic in the subsystems
are decoupled from each other by the Message Bus. Use of the Message Bus in
general supports the following operational concepts:
• All communication utilizes a common medium (the Message Bus). This encap-
sulates the nature of the messages, the location of the parties, and the commu-
nication protocol.
• No use case initiator (such as Clients) and use case executioner (such as
Managers) ever interact directly. If they are unaware of each other, they can
evolve separately, which fosters extensibility.
• A multiplicity of concurrent Clients can interact in the same use case, with
each performing its part of the use case. There is no lock-step execution across
Clients and system. This, in turn, leads to timeline separation and decoupling
of the components along the timeline.
• High throughput is possible because the queues underneath the Message Bus
can accept a very large number of messages per second.
When using this design pattern, the “application” is nowhere to be found. There
is no collection of components or services that you can point to and identify as
the application. Instead, the system comprises a loose collection of services that
post and receive messages to one another (over a message bus, although that is
secondary consideration). These messages are related to each other. Each service
processing a message does some unit of work, and then posts a message back to the
bus. Other services will subsequently examine the message, and some of them (or
one of them, or none of them) will decide to do something. In effect, the message
post by one service triggers another service to do something unbeknownst to the
posting service. This stretches decoupling almost to the limit.
Often the same logical message may traverse all the services. Likely the services
will add additional contextual information to the message (such as in the head-
ers), modify previous context, pass context from the old message to a new one,
and so on. In this way, the services act as transformation functions on the mes-
sages. The paramount aspect of the Message Is the Application pattern is that the
required behavior of the application is the aggregate of those transformations plus
THE A RCHITECTURE 121
the local work done by the individual services. Any required behavior changes
induce changes in the way your services respond to the messages, rather than the
architecture or the services.
The business objectives for TradeMe justified the use of this pattern because of
the required extensibility. The company can extend the system by adding message
processing services, thereby avoiding modification of existing services and risk to
a working implementation. This correctly supports the directive from Chapter 3
that you should always build systems incrementally, not iteratively. The objective
of forward-looking design is also well served here because nothing in this pattern
ties the system to the present requirements. This pattern is also an elegant way of
integrating external systems—yet another business objective.
FUTURE-LOOKING DESIGN
The use of granular services integrated over a message bus with the Message
Is the Application design pattern is one of the best ways of preparing the sys-
tem for the future. By “preparing the system for the future,” I specifically refer
to the next epoch in software engineering, the use of the actor model. Over
the next decade the software industry will likely adopt a very granular use of
services called actors. While actors are services, they are very simple services.
Actors reside in a graph or grid of actors, and they only interact with each
other using messages. The resulting network of actors can perform calcula-
tions or store data. The program is not the aggregate of the code of the actors;
instead, the program or the required behavior consists of the progression of
messages through the network. To change the program, you change the net-
work of actors, not the actors themselves.
Building systems this way offers fundamental benefits such as better affinity to
real-life business models, high concurrency without locking, and the ability to
build systems that are presently out of reach, such as smart power grids, com-
mand and control systems, and generic AI. Using current-day technology and
platforms along with the Message Is the Application is very much aligned—if
not most of the way there—with the actor model. For example, in TradeMe,
tradesmen and contractors are actors. Projects are networks of these actors,
and other actors (such as Market Manager) compose the network. Adopting
the TradeMe architecture today prepared the company for the future without
compromising on the present.a
a
For more on the actor model, see Juval Lowy, Actors: The Past and Future of Software Engineering
(YouTube/IDesignIncTV, 2017).
122 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
As with everything in life, implementing this pattern comes with a cost. Not every
organization can justify using the pattern or even having a message bus. The cost
will almost always take the form of additional system complexity and moving
parts, new APIs to learn, deployment and security issues, intricate failure scenarios,
and more. The upside is an inherently decoupled system geared toward require-
ments churn, extensibility, and reuse. In general, you should use this pattern when
you can invest in a platform and have the backing of your organization both top-
down and bottom-up. In many cases, a simpler design in which the Clients just
queue up calls to the Managers would be a better fit for the development team.
Always calibrate the architecture to the capability and maturity of the developers
and management. After all, it is a lot easier to morph the architecture than it is
to bend the organization. Once the organizational capabilities have matured, you
can incorporate a full Message Is the Application pattern.
WORKFLOW MANAGER
With The Method, the Managers encapsulate the volatility in the business work-
flows. Nothing prevents you from simply coding the workflows in the Mangers
and then, when the workflows change, changing the code in the Managers. The
problem with this approach is that the volatility in the workflows may exceed the
developers’ ability, as measured in time and effort, to catch up using just code.
To add or change a feature, you simply add or change the workflows of the
Managers involved, but not necessarily the implementation of the individual
THE A RCHITECTURE 123
The real necessity for using a workflow Manager arises when the system must han-
dle high volatility. With a workflow Manager, you merely edit the required behavior
and deploy the newly generated code. The nature of this editing is specific to the
workflow tool you choose. For example, some tools use script editors, whereas oth-
ers use visual workflows that look like activity diagrams and generate or even deploy
the workflow code.
You can even (with the right safeguards) have the product owners or the end users
edit the required behavior. This drastically reduces the cycle time for delivering
features and allows the software development team to focus on the core services
as opposed to chasing changes in the requirements.
The business needs for TradeMe justified the use of this pattern because the objec-
tive of a quick turnaround for features is impossible to meet using hand-crafted
coding by a small, thinly spread team. Use of workflow Manager enables a high
degree of customization across markets, satisfying another objective for the system.
DESIGN VALIDATION
You must know before work commences whether the design can support the
required behaviors. As Chapter 4 explains, to validate your design, you need to
show that the design can support the core use cases by integrating the various
areas of volatility encapsulated in your services. You validate the design by show-
ing the respective call chain or sequence diagram for each use case. You may
require more than one diagram to complete a use case.
It is important to demonstrate that your design is valid not just to yourself, but
also to others. If you cannot validate your architecture, or if the validation is too
ambiguous, you need to go back to the drawing board.
Figure 5-17 shows that the execution of the use case requires interaction between
a Client application and the membership subsystem. This is evident in the actual
call chains of Figure 5-18 (the Adding Contractor use case is identical but with the
contractor’s application, the Contractors Portal). Following the operational
concepts of TradeMe, in Figure 5-18 the Client application (in this case, either the
Tradesman Portal when the member is applying directly or the Marketplace
App when the back-end rep is adding the member) posts the request to the
Message Bus.
Tradesman Membership
[No] [Yes]
Valid?
Process Payment
Figure 5-17 The Add Tradesman/Contractor use case with swim lanes
Tradesman Marketplace
Portal App
Message
Bus
Regulations Membership
Engine Manager
workflow execution. Once the workflow has fi nished executing the request, the
Membership Manager posts a message back into the Message Bus indicating
the new state of the workflow, such as its completion, or perhaps indicating that
some other Manager can start its processing now that the workflow is in a new
state. Clients can monitor the Message Bus as well and update the users about
their requests. The Membership Manager consults the Regulation Engine
that is verifying the tradesman or contractor, adds the tradesman or contractor to
the Members store, and updates the Clients via the Message Bus.
Contractor Market
[No] [Yes]
Valid?
Figure 5-19 The Request Tradesman use case with swim lanes
The call chains are depicted in Figure 5-20. Clients such as the Contractors
Portal or the internal user of the Marketplace App post a message to the
bus requesting a tradesman. The Market Manager receives that message. The
Market Manager loads the workflow corresponding to this request, and per-
forms actions such as consulting with the Regulation Engine about what may
be valid for this request or updating the project with the request for a tradesman.
The Market Manager can then post back to the Message Bus that someone is
D ESIGN VALIDATION 127
Contractors Marketplace
Portal App
Message
Bus
Regulations Market
Engine Manager
requesting a tradesman. This will trigger the matching and assignment workflows,
all separated on the timeline.
Once you realize that regulations and search are all elements of the market, you
can refactor the activity diagram to that shown in Figure 5-22. This enables easy
mapping to your subsystems design.
Figure 5-23 depicts the corresponding call chain. Again, this call chain is symmet-
rical with other call chains, in the sense that the fi rst action is to load the appro-
priate workflow and execute it. The last call of the call chain to the Message Bus
and to the Membership Manager triggers the Assign Tradesman use case.
128 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
Request Verify
Match Request
[No] [Yes]
Valid?
Find Possible
Tradesman
Figure 5-21 The Match Tradesman use case with swim lanes
[No] [Yes]
Valid?
Error Search
Figure 5-22 Refactored swim lanes for the Match Tradesman use case
D ESIGN VALIDATION 129
Marketplace
App
Message Membership
Bus Manager
Search Market
Engine Manager
Figure 5-23 Call chains for the Match Tradesman use case
Notice the composability of this design. For example, suppose the company really
does need to handle acute volatility in analyzing the project’s needs. The call
chain for fi nding a match allows for separating search from analysis. You would
add an Analysis Engine to encapsulate the separate set of analysis algorithms.
The business can even leverage TradeMe for some business intelligence to answer
questions like “Could we have done things better?” For example, a call chain sim-
ilar to Figure 5-23 could be used for the much more involved scenario of “Analyze
all projects between 2016 and 2019” and the design of the components would not
have to change at all. The number of these use cases is likely open, and that is the
whole point: You have an open-ended design that can be extended to implement
any of these future scenarios, a true composable design.
[No] [Yes]
Valid?
Verify Tradesman
Still Available
Error
[No] [Yes] Assign to
Valid?
Project
Request Assign
Verify Request
Tradesman
[No] [Yes]
Valid?
Verify Tradesman
Still Available
Error
[No] [Yes] Assign to
Valid?
Project
Figure 5-25 Unified swim lanes of the Assign Tradesman use case
D ESIGN VALIDATION 131
As with all previous call chains, Figure 5-26 shows how the Membership Manager
is executing the workflow that ultimately leads to assigning the tradesman to the
project. This is a collaborative work between the Membership Manager and
the Market Manager, with each managing its respective subsystem. Note that
the Membership Manager is unaware of the Market Manager—it just posts
a message to the bus. The Market Manager receives that message and updates
the project according to its internal workflow. The Market Manager may, in
turn, post another message to the Message Bus to trigger another use case, such
as issuing a report on the project, or billing for the contractor, or pretty much
anything. This is what the Message Is the Application design pattern is all about:
The logical “assignment” message weaves its way between the services, triggering
local behaviors as it goes. The Client can also monitor the Message Bus and may
advise the user that the assignment is in progress.
Marketplace
App
Message Message
Bus Bus
Figure 5-26 Call chains for the Assign Tradesman use case
Figure 5-27 shows the call chain for terminating a tradesman. The Market Manager
initiates the termination workflow and notifies the Membership Manager of the
termination.
Contractors Marketplace
Portal App
Message Message
Bus Bus
Figure 5-27 Call chains for the Terminate Tradesman use case
Any error condition or deviation from the “happy path” would add a dashed gray
arrow from the Membership Manager back to the Message Bus and ultimately
back to the client. Figure 5-28 is a sequence diagram demonstrating this interac-
tion, without the calls between the ResourceAccess services and the Resources.
Finally, the call chain diagram in Figure 5-27 (or the sequence diagram of Figure
5-28) assumes the termination use case is triggered when a project is completed,
and the contractor terminates the assigned tradesmen. But it can also be trig-
gered by the tradesman posting a message from the Tradesman Portal to the
Membership Manager, which would cause the call chain to flow in the opposite
direction (Membership Manager to Market Manager and on to the Client
apps). Again, this is a testimony to the versatility of the design.
D ESIGN VALIDATION 133
Contractors Message Market Workflow Project Membership Workflow Regulations Regulations Members
Portal Bus Manager Access Access Manager Access Engine Access Access
Figure 5-28 Sequence diagram for the Terminate Tradesman use case
Timer
Message
Bus
Market
Manager
Workflows Payment
Access Access
Figure 5-29 Call chains for the Pay Tradesman use case
134 C HAPTER 5 S YSTEM D ESIGN E X AMPLE
Unlike the previous call chains, the payment is triggered by a scheduler or timer
that the customer has already in service. The scheduler is decoupled from the
actual components and has no knowledge of the system internals: All it does is
post a message to the bus. The actual payment is made by PaymentAccess
when updating the Payments store and accessing an external payment system, a
Resource to TradeMe.
Contractors Marketplace
Portal App
Message
Bus
Market
Manager
Workflows Projects
Access Access
Workflows Projects
Figure 5-30 Call chains for the Create Project use case
Contractors Marketplace
Portal App
Message Message
Bus Bus
Figure 5-31 Call chains for the Close Project use case
WHAT’S NEXT?
This lengthy system design case study concludes the fi rst part of this book. Having
the system design in hand is just the fi rst ingredient of success. Next comes project
design. You should strike while the iron is hot: Always follow system design with
project design, ideally back-to-back, as a continuous design effort.
This page intentionally left blank
Part II
PROJECT DESIGN
137
This page intentionally left blank
6
MOTIVATION
Much as you design the software system, you must design the project to build the
system. This includes accurately calculating the planned duration and cost, devis-
ing several good execution options, scheduling resources, and even validating your
plan to ensure it is sensible and feasible. Project design requires understanding the
dependencies between services and activities, the critical path of integration, the
staff distribution, and the risks involved. All of these challenges stem from your
system design, and addressing them properly is an engineering task. As such, it is
up to you, the software architect, as the engineer in charge, to design the project.
You should think of project design as a continuation of the system design effort.
Combining system design and project design yields a nonlinear effect that drasti-
cally improves the likelihood of success for the project. It is also important to note
that project design is not part of project management. Instead, project design is to
project management what architecture is to programming.
The second part of this book is all about project design. The following chapters
present conventional ideas along with my original, battle-proven techniques and
methodologies, covering the core body of knowledge of modern software project
design. This chapter provides the background and essential motivation for pro-
ject design.
139
140 C HAPTER 6 M OTIVATION
When you design a project you must provide management with several viable
options trading schedule, cost, and risk, allowing management and other decision
makers to choose up front the solution that best fits their needs and expectations.
Providing options in project design is the key to success. Finding a balanced solu-
tion and even an optimal solution is a highly engineered design task. I say it is
“engineered” not just because of the design and calculations involved, but because
engineering is all about tradeoffs and accommodating reality.
Adding to the challenge of project design is the reality that no single correct solu-
tion exists even for the same set of constraints, much as there are always several
possible design approaches for any system. Projects designed to meet an aggres-
sive schedule will cost more and be far more risky and complex than projects
designed to reduce cost and minimize risk. There is no “THE Project”; there are
only options. Your task is to narrow this spectrum of near-countless possibilities
to several good project design options, such as the following:
The following chapters will show you how to identify good project design options.
If you do not provide these options, you will have no one to blame but yourself for
conflicts with management. How often do you labor on the design of the system
and then present it to management, only to have managers ordain, “You have a
year and four developers.” Any correlation between that year, the four develop-
ers, and what it really takes to deliver the system is accidental—and so are your
chances of success. However, if you present the same architecture, accompanied
by three to four project design options, all of them doable, but reflecting different
tradeoffs of schedule, cost, and risk, a completely different dynamic will rule the
meeting. Now the discussion will revolve around which of these options to choose.
You must provide an environment in which managers can make good decisions.
The key is to provide them with only good options. Whichever option they do
choose will then be a good decision.
to represent all activities, and to recognize several options for building the system.
It allows the organization to determine whether it even wants to get the project
done. After all, if the true cost and duration will exceed the acceptable limits, why
start the work in the first place, only to have the project canceled once you run out
of money or time?
Once project design is in place, you eliminate the commonplace gambling with
costs, development death marches, wishful thinking about project success, and
horrendously expensive trials and errors. After work commences, a well-designed
project also lays the foundation for decision makers to evaluate and think through
the effect of a proposed change on the schedule and the budget.
ASSEMBLY INSTRUCTIONS
Project design involves much more than just proper decision making. Project design
also serves as the system assembly instructions. To use an analogy, would you buy
a well-designed IKEA furniture set without the assembly instructions booklet?
Regardless of how comfortable or convenient the item is, you would recoil at the
mere thought of trying to guess where each of the dozens of pins, bolts, screws,
and plates go, and in which order.
Your software system is significantly more complex than furniture, yet often archi-
tects presume developers and projects managers can just go about assembling the
system, figuring it out as they go along. This ad hoc approach is clearly not the
most efficient way of assembling the system. As you will see in the next chapters,
project design changes the situation, since the only way to know how long it will
take and how much it will cost to deliver the system is to figure out fi rst how you
will build it. Consequently, each project design option comes with its own set of
assembly instructions.
HIERARCHY OF NEEDS
In 1943, Abraham Maslow published a pivotal work on human behavior, known
as Maslow’s hierarchy of needs.1 Maslow ranked human needs based on their rel-
ative importance and suggested that only once a person has satisfied a lower-level
need could that person develop an interest in satisfying a higher-level need. This
hierarchal approach can describe another category of complex beings—software
projects. Figure 6-1 shows a software project’s hierarchy of needs in the shape of
a pyramid.
1. A. H. Maslow, “A Theory of Human Motivation,” Psychological Review 50, no. 4 (1943): 370–396.
142 C HAPTER 6 M OTIVATION
Tools,
Languages,
Frameworks,
Methodology
Technology
The project needs can be classified into five levels: physical, safety, repeatability,
engineering, and technology.
1. Physical. This is the lowest level in the project pyramid of needs, dealing with
physical survival. Much as a person must have air, food, water, clothing, and
shelter, a project must have a workplace (even a virtual one) and a viable busi-
ness model. The project must have computers to write and test the code, as
well as people assigned to perform these tasks. The project must have the right
legal protection. The project must not infringe on existing external intellec-
tual property (IP), yet must also protect its own IP.
2. Safety. Once the physical needs are satisfied, the project must have adequate
funding (often in the form of allocated resources) and enough time to com-
plete the work. The work itself must be performed with acceptable risk—not
too safe (because low-risk projects are likely not worth doing) and not too
risky (because high-risk projects will likely fail). In short, the project must be
reasonably safe. Project design operates at this level in the pyramid of needs.
WHY P ROJECT D ESIGN ? 143
In a hierarchy of needs, higher-level needs serve the lower-level needs. For example,
according to Maslow, food is a lower-level need than employment, yet most people
work so that they can eat, as opposed to eat so that they can work. Similarly, the
technology serves the engineering needs (such as the architecture), and the engi-
neering needs serve the safety needs (those that project design provides). It also
means that chronologically you have to design the system fi rst; only then can you
design the project to build it.
You can validate the pyramid by listing all necessary ingredients for success of
a typical software project. You then prioritize, sort, and fi nally group them into
categories of needs.
As an experiment in this process, consider the following two projects. The fi rst has
a tightly coupled design, a high cost of maintenance, and a low level of reuse, and
is difficult to extend. However, there is adequate time to perform the work, and the
project is properly staffed. The second project has an amazing architecture that is
modular, extensible, and reusable; addresses all the requirements; and is future-
proof. However, the team is understaffed, and even if the people were available,
there is not enough time to safely develop the system. Ask yourself: Which project
you would like to be part of?
144 C HAPTER 6 M OTIVATION
The answer is unequivocally the fi rst project. Consequently, project design must
rank lower (i.e., be more foundational) in the pyramid of needs than architecture.
A classic reason for failure for many software projects is an inverted pyramid
of needs. Imagine Figure 6-1 turned on its head. The development team focuses
almost exclusively on technology, frameworks, libraries, and platforms; spends
next to nothing on architecture and design; and completely ignores the fundamen-
tal issues of time, cost, and risk. This makes the pyramid of needs unstable, and it
is small wonder that such projects fail. By investing in the safety level of the pyra-
mid using the tools of project design, you stratify the needs of the project, provide
the foundation that stabilizes the upper levels, and drive the project to success.
7
PROJECT DESIGN OVERVIEW
The following overview describes the basic methodology and techniques you apply
when designing a project. A good project design includes your staffing plan, scope
and effort estimations, the services’ construction and integration plan, the detailed
schedule of activities, cost calculation, viability and validation of the plan, and
setup for execution and tracking.
This chapter covers most of these concepts of project design, while leaving certain
details and one or two crucial concepts for later chapters. However, even though
it serves as a mere overview, this chapter contains all the essential elements of
success in designing and delivering software projects. It also provides the develop-
ment process motivation for the design activities, while the rest of the chapters are
more technical in nature.
DEFINING SUCCESS
Before you continue reading, you must understand that project design is about suc-
cess and what it takes to succeed. The software industry at large has had such a poor
track record that the industry has changed its very definition of success: Success today
is defined as anything that does not bankrupt the company right now. With such a
low bar, literally anything goes and nothing matters, from low quality to deceiving
numbers and frustrated customers. My definition of success is different, though it is
also a low bar in its own way. I define success as meeting your commitments.
If you call for a year for the project and $1 million in costs, I expect the project to
take one year, not two, and for the project to cost $1 million, not $3 million. In the
software industry, many people lack the skills and training it takes to meet even this
low bar for success. The ideas presented in this chapter are all about accomplish-
ing just that.
A higher bar is to deliver the project the fastest, least costly, and safest way. Such
a higher bar requires the techniques described in the following chapters. You can
145
146 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
raise the bar even further and call for having the system architecture remain good
for decades and be maintainable, reusable, extensible, and secure across its entire
long and prosperous life. That would inevitably require the design ideas of the fi rst
part of this book. Since, in general, you need to walk before you can run, it is best
to start with the basic level of success and work your way up.
REPORTING SUCCESS
Part 1 of this book stated a universal design rule: Features are always and every-
where aspects of integration, not implementation. As such, there are no features
in any of the early services. At some point you will have integrated enough to start
seeing features. I call that point the system. The system is unlikely to appear at
the very end of the project since there may be some additional concluding activi-
ties such as system testing and deployment. The system typically appears toward
the end because it requires most of the services as well as the clients. When using
The Method, this means only once you have integrated inside the Managers, the
Engines, the ResourceAccess, and the Utilities can you support the behaviors that
the Clients require.
While the system is the product of the integration, not all the integration happens
inside the Managers. Some integration happens before the Managers are com-
plete (such as the Engines integrating ResourceAccess) and some integration hap-
pens after the Managers (such as between the Clients and the Managers). There
might also be explicit integration activities, such as developing a client of a service
against a simulator and then integrating the client with the real service.
The problem with the system appearing only toward the end of the project is
pushback from management. Most people tasked with managing software devel-
opment do not understand the design concepts in this book and simply want
features. They would never stop to think that if a feature can appear early and
quickly, then it does not add much value for the business or the customers because
the company or the team did not spend much effort on the feature. Usually, man-
agement uses features as the metric to gauge progress and success, and tends to
cancel sick projects that do not show progress. As such, the project faces a serious
risk: It could be perfectly on schedule but because the system only appears at the
end, if the project bases its progress report on features, it is asking to be canceled.
The solution is simple:
Most managers will recoil both at spending some three or four months on design
and at skipping design entirely. They may wish to accelerate the design effort by
having more architects participate. However, requirements analysis and archi-
tecture are contemplative, time-consuming activities. Assigning more architects
148 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
to these activities does not expedite them at all, but instead will make matters
worse. Architects are typically senior self-confident personnel, used to working
independently. Assigning multiple architects only results in them contesting with
each other, rather than in system and project design blueprints.
One way of resolving the multiple architects conflict is to appoint a design commit-
tee. Unfortunately, the surest way of killing anything is to appoint a committee to
oversee it. Another option is to carve up the system and assign each architect a spe-
cific area to design. With this option, the system is likely to end as a Chimera—a
mythological Greek beast that has the head of a lion, the wings of a dragon, the
front legs of an ox, and the hind legs of a goat. While each part of the Chimera
is well designed and even highly optimized, the Chimera is inferior at anything it
attempts: It does not fly as well as a dragon, run as fast as a lion, pull as much as an
ox, or climb as well as a goat. The Chimera lacks design integrity—and the same
is true when multiple architects design the system, each responsible for their part.
A single architect is absolutely crucial for design integrity. You can extend this
observation to the general rule that the only way to allow for design integrity is
to have a single architect own the design. The opposite is also true: If no single
person owns the design and can visualize it cover-to-cover, the system will not
have design integrity.
Additionally, with multiple architects no one owns the in-betweens, the cross-
subsystem or even cross-services design aspects. As a result, no one is accountable
for the system design as a whole. When no one is accountable for something, it
never gets done, or at best is done poorly.
With a single architect in charge, that architect is accountable for the system
design. Ultimately, being accountable is the only way to earn the respect and trust
of management. Respect always emerges out of accountability. When no one is
accountable, as is the case with a group of architects, management intuitively has
often nothing but scorn for the architects and their design effort.
Caution A single architect in charge does not mean the architect’s work
is exempt from review by other professional architects. Being accountable
for the design does not imply working in isolation or avoiding constructive
criticism. The architect should seek out such reviews to verify that the de-
sign is adequate.
P ROJECT I NITIAL S TAFFING 149
Junior Architects
Most software projects need only a single architect. This is true regardless of the
project size and is essential for success. However, large projects very easily saturate
the architect with various responsibilities, preventing the architect from focusing
on the key goal of designing the system and keeping the design from drifting
away during development. Additionally, the role of architect involves technical
leadership, requirements review, design review, code review for each service in the
system, design documents updates, discussion of feature requests from marketing,
and so on.
Management can address this overload by assigning a junior architect (or more
than one) to the project. The architect can offload many secondary tasks to the
junior architect, allowing the architect to focus on the design of the system and the
project at the beginning and on keeping the system true to its design throughout
the project. The architect and the junior architect are unlikely to compete because
there is no doubt who is in charge, and there are clearly delineated lines of respon-
sibilities. Having junior architects is also a great way of grooming and mentoring
the next generation of architects for the organization.
Most organizations and teams have these roles, but the job titles they use may be
different. I defi ne these roles as follows:
• The project manager. The job of the project manager is to shield the team from
the organization. Most organizations, even small ones, create too much noise.
If that noise makes its way into the development team, it can paralyze the team.
A good project manager is like a fi rewall, blocking the noise, allowing only
sanctioned communication through. The project manager tracks progress and
reports status to management and other project managers, negotiates terms,
and deals with cross-organization constraints. Internally, the project manager
assigns work items to developers, schedules activities, and keeps the project on
schedule, on budget, and on quality. No one in the organization other than the
project manager should assign work activity or ask for status from developers.
150 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
• The product manager. The product manager should encapsulate the custom-
ers. Customers are also a constant source of noise. The product manager acts
as a proxy for the customers. For example, when the architect needs to clar-
ify the required behaviors, the architect should not chase customers; instead,
the product manager should provide the answers. The product manager also
resolves confl icts between customers (often expressed as mutually exclusive
requirements), negotiates requirements, defi nes priorities, and communicates
expectations about what is feasible and on what terms.
• The architect. The architect is the technical manager, acting as the design lead,
the process lead, and the technical lead of the project. The architect not only
designs the system, but also sees it through development. The architect needs to
work with the product manager to produce the system design and with the proj-
ect manager to produce the project design. While the collaboration with both
the product manager and the project manager is essential, the architect is held
responsible for both of these design efforts. As a process lead, the architect has
to ensure the team builds the system incrementally, following the system and the
project design with a relentless commitment to quality. As a technical lead, the
architect often has to decide on the best way of accomplishing technical tasks
(the what-to-do) while leaving the details (the how-to-do) to developers. This
requires continuous hands-on mentoring, training, and reviews.
Perhaps the most glaring omission from this defi nition of the core team are devel-
opers. Developers (and testers) are transient resources that come and go across
projects—a very important point that this chapter revisits as part of the discussion
of scheduling activities and resource assignment.
Unlike developers, the core team stays throughout the project since the project
needs all three roles from beginning to end. However, what these roles do in the
project changes over time. For example, the project manager shifts from nego-
tiating with stakeholders to providing status reports, and the product manager
shifts from gathering requirements to performing demos. The architect shifts from
designing the system and the project to providing ongoing technical and process
leadership, such as conducting design and code reviews at the service level and
resolving technical conflicts.
architecture is merely a means to an end: project design. Since the architect needs
to work with the product manager on the architecture and with the project man-
ager on the project design, the project requires the core team at the beginning of
the project.
Software projects are never constraint-free. All projects face some constraints on
time, scope, effort, resources, technology, legacy, business context, and so on.
These constraints can be explicit or implicit. It is vital to invest the time in both ver-
ifying the explicit constraints and discovering the implicit constraints. Designing
a system and project that violates a constraint is a recipe for failure. From my
experience, a software project should spend roughly between 15% and 25% of the
entire duration of the project in the front end, depending on constraints.
EDUCATED DECISIONS
It is pointless to approve a project without knowing its true schedule, cost, and
risk. After all, you would not buy a house without knowing how much it costs.
You would not buy a house that you can afford up front but whose upkeep and
taxes you cannot pay. In any walk of life, it is obvious that you commit time and
capital only after the scope is known. Many software projects recklessly proceed
with no idea of the real time and cost required.
1. https://en.wikipedia.org/wiki/Front_end_innovation
152 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
staffing a project before the commitment is made has a tendency to force the proj-
ect ahead regardless of affordability. If the right thing to do is to avoid doing the
project in the fi rst place, the organization will only be wasting good money. A rush
to commit the resources will almost always be accompanied by a poor functional
design and no plan at all—hardly the ingredients of success.
The key to success is to make educated decisions, based on sound design and scope
calculations. Wishful thinking is not a strategy, and intuition is not knowledge,
especially when dealing with complex software systems.
Note The inability to make educated decisions about time and cost is
a constant source of frustration for business stakeholders when working
with software teams. The business people in charge simply want to know
the cost involved and when they can realize the value of the effort. Avoid-
ing these questions eventually creates tension, suspicion, and animosity
between the team and management. Business-side people are used to plan-
ning and budgeting. Their fair expectation of software professionals is that
they should have the expertise to do the same.
For example, consider a 10-man-year project—that is, a project where the sum of
effort across all activities is 10 man-years. Suppose management asks for the least
costly way of building the system. Such a project would have one person work-
ing for 10 years, but management is unlikely to be willing to wait 10 years. Now
suppose that management asks for the quickest possible way to build the system.
Imagine it is possible to build the same system by engaging 3650 people for 1 day
(or even 365 people for 10 days). Management is unlikely to hire so many people
for such short durations. Similarly, management will never ask for the safest way
of building the system (because anything worth doing requires risk, and safe proj-
ects are not worth doing) or knowingly go for the riskiest way of doing the project.
S ERVICES AND D EVELOPERS 153
Once the desired option is identified, management must literally sign off on the
SDP document. This document now becomes your project’s life insurance policy
because, as long as you do not deviate from the plan’s parameters, there is no
reason to cancel your project. This does require proper tracking (as described in
Appendix A) and project management.
If no option is palatable, then you need to drive the right decision—in this case,
killing the project. A doomed project, a project that from inception did not receive
adequate time and resources, will do no one any good. The project will eventually
run out of time or money or both, and the organization will have wasted not just
the funds and time but the opportunity cost of devoting these resources to another
doable project. It is also detrimental to the careers of the core team members to
be on a project that never has a chance. Since you have only a few years to make
your mark and move ahead, every project must count and be a feather in your cap.
Spending a year or two on a sideways move that failed will limit your career pros-
pects. Killing such a project before development starts is beneficial for all involved.
to fi nish one service and move to the next. However, you should never see a devel-
oper working on more than one service at a time or more than one developer
working concurrently on the same service. Any other way of assigning services to
developers will result in failure. Examples of the poor assignment options include:
• Multiple developers per service. The motivation for assigning two (or more)
developers to one service is not a surplus of developers, but rather the desire to
complete the work sooner. However, two people cannot really work on the same
thing at the same time, so some subscheme must be used:
- Serialization. The developers could work serially so that only one of them
is working on the service at a time. This takes longer due to the context
switch overhead—that is, the need to figure out what happened with the ser-
vice since the current developer looked at it last. This defeats the purpose of
assigning the two developers in the fi rst place.
- Parallelization. The developers could work in parallel and then integrate
their work. This scheme will take much longer than just having a single devel-
oper working on the service. For example, suppose a service estimated as one
month of effort is assigned to two developers who will work in parallel. One
might be tempted to assume that the work will be complete after two weeks,
but that is a false assumption. First, not all units of work can be split this
way. Second, the developers would have to allocate at least another week to
integrate their work. This integration is not at all guaranteed to succeed if the
developers worked in parallel and did not collaborate during development.
Even if the integration is possible, it would void all the testing effort that
went into each part due to the integration changes. Testing the service as a
whole also would require additional time. In all, the effort will take at least
a month (and likely more). Meanwhile, other developers who are working on
dependent services and expect the service to be ready after two weeks will
be further delayed.
• Multiple services per developer. The option of assigning two (or more) services
to a single developer is just as bad. Suppose two services, A and B, each esti-
mated as a month of work, are assigned to a single developer, with the devel-
oper expected to fi nish both after a single month. Since the sum of work is two
months, not only will the services be incomplete after one month, but fi nishing
them will take much longer. While the developer is working on the A service,
the developer is not working on the B service, causing the developers dependent
on the B service to demand that the developer work on the B service. The devel-
oper might switch to the B service, but then those dependent on the A service
would demand some attention. All this switching back and forth drastically
S ERVICES AND D EVELOPERS 155
reduces the developer’s efficiency, prolonging the duration to much more than
two months. In the end, perhaps after three or four months, the A and B services
may be complete.
Either assigning more than one developer per service or assigning multiple services
per developer causes a mushroom cloud of delays to propagate throughout the
project, mostly due to delayed dependencies affecting other developers. This, in
turn, makes accurate estimations very difficult. The only option that has any sem-
blance of accountability and a chance of meeting the estimation is a 1:1 assignment
of services to developers.
Figure 7-1 The system’s design is the team’s design. (Images: Sapann Design/
Shutterstock)
The relationship between the services, their interactions and communication, dic-
tates the relationships and interactions between the developers. When using 1:1
assignment, the design of the system is the design of the team.
156 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
Next, consider Figure 7-2. While the number of services and their size has not
changed from Figure 7-1, no one could claim it is a good design.
A good system design strives to reduce the number of interactions between the
modules to their bare minimum—the exact opposite of what happens in Figure 7-2.
A loosely coupled system design such as that in Figure 7-1 has minimized the
number of interactions to the point that removing one interaction makes the sys-
tem inoperable.
The design in Figure 7-2 is clearly tightly coupled, and it also describes the way the
team operates. Compare the teams from Figure 7-1 and Figure 7-2. Which team
would you rather join? The team in Figure 7-2 is a high-stress, fragile team. The
team members are likely territorial and resist change because every change has
ripple effects that disrupt their work and the work of everybody else. They spend
an inordinate amount of time in meetings to resolve their issues. In contrast, the
team in Figure 7-1 can address issues locally and contain them. Each team mem-
ber is almost independent from the others and does not need to spend much time
coordinating work. Simply put, the team in Figure 7-1 is far more efficient than
the team in Figure 7-2. As a result, the team with the better system design has far
better prospects of meeting an aggressive deadline.
This last observation is paramount: Most managers just pay lip service to sys-
tem design because the benefits of architecture (maintainability, extensibility, and
reusability) are down-the-road benefits. Future benefits do not help a manager
who is facing the harsh reality of scant resources and a tight schedule. If anything,
it behooves the manager to reduce the scope of work as much as possible to meet
E FFORT E STIMATIONS 157
the deadline. Since system design is supposedly not helping with the current objec-
tives, the manager will throw overboard any meaningful investment in design.
Sadly, by doing so, the manager loses all chance of meeting the commitments,
because the only way to meet an aggressive deadline is with a world-class design
that yields the most efficient team. When striving to get management support for
your design effort, show how design helps with the immediate objective. The long-
term benefits will flow out of that.
TASK CONTINUITY
When assigning services (or activities such as UI development), try to maintain
task continuity, a logical continuation between tasks assigned to each person.
Often, such task assignments follow the service dependency graph. If service
A depends on service B, then assign A to the developer of B. One advantage is
that the A developer who is already familiar with B needs less ramp-up time. An
important, yet often overlooked advantage of maintaining task continuity is that
the project and the developer’s win criteria are aligned. The developer is moti-
vated to do an adequate job on B to avoid suffering when it is time to do A. Perfect
task continuity is hardly ever possible, but it should be the goal.
Finally, take the developers’ personal technical proclivities into account when mak-
ing assignments. For example, it will likely not work well to have the security expert
design the UI, to have the database expert implement the business logic, or to have
junior developers implement the utilities such as message bus or diagnostics.
EFFORT ESTIMATIONS
Effort estimation is how you try to answer the question of how long something will
take. There are two types of estimations: individual activity estimation (estimating
the effort for an activity assigned to a resource) and overall project estimation. The
two types of estimations are unrelated, because the overall duration of the project is
not the sum of effort across all activities divided by number of resources. This is due
to the inherent inefficiency in utilizing people, the internal dependencies between
activities, and any risk mitigation you may need to put in place.
158 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
1. Uncertainty in how long activities take, and even uncertainty in the list of
activities, is the primary reason for poor accuracy of estimations. Do not con-
fuse cause and effect: The uncertainty is the cause, and poor estimation accu-
racy is the result. You must proactively reduce the uncertainty, as described
later in this chapter.
2. Few people in software development are trained in simple and effective esti-
mation techniques. Most are left to rely on bias, guesswork, and intuition.
3. Many people overestimate or underestimate in an attempt to compensate for
the uncertainty, which results in far worse outcomes.
4. Most people tend to look at just the tip of the iceberg when listing activities.
Naturally, if you omit activities that are essential to success, your estimations
will be off. This is true both when omitting activities across the project and
when omitting internal phases inside activities. For example, estimators may
list just the coding activities or, inside coding activities, account for coding but
not design or testing.
CLASSIC MISTAKES
As just mentioned, people tend to overestimate and underestimate in an attempt
to compensate for uncertainty. Both of these are deadly when it comes to project
success.
Overestimation never works because of Parkinson’s law. 2 For example, if you give
a developer three weeks to perform a two-week activity, the developer will simply
not work on it for two weeks and then be idle for a week. Instead, the developer
will work on the activity for three weeks. Since the actual work consumed only
two of those three weeks, in the extra week the developer will engage in gold
plating—adding bells and whistles, aspects, and capabilities that no one needs
or wants, and that were not part of the design. This gold plating significantly
increases the complexity of the task, and the increased complexity drastically
reduces the probability of success. Consequently, the developer labors for four or
six weeks to fi nish the original task. Other developers in the project, who expect
to receive the code after three weeks, are now delayed, too. Furthermore, the team
now owns, perhaps for years and across multiple versions, a code module that is
needlessly more complex than what it should have been in the fi rst place.
Sadly, there is no quick-and-dirty with any intricate task. Instead, the two options
are quick-and-clean and dirty-and-slow. Because the developer is missing all the
best practices in software development, from testing to detailed design to docu-
mentation, the developer is now trying to perform the task in the worst possible
way. Consequently, the developer will not work on the activity for the nominal
two weeks it could have taken, assuming the work was performed correctly, but
will work on it for four or six (or more) weeks due to the low quality and increased
complexity. As with overestimation, other developers in the project who expected
the code after the scheduled two days are much delayed. Furthermore, the team
now has to own, perhaps for years and across multiple versions, a code module
that is done the worst possible way.
Probability of Success
While these conclusions may make common sense, what many miss is the magni-
tude of these classic mistakes. Figure 7-3 plots in a qualitative manner the prob-
ability of success as a function of the estimation. For example, consider a 1-year
project. With proper architecture and project design, the project’s normal estima-
tion is 1 year, indicated by point N in Figure 7-3. What would be the probability of
success if you give this project a day? A week? A month? Clearly, with sufficiently
aggressive estimations, the probability of success is zero. How about 6 months?
While the probability of a 1-year project completing in 6 months is extremely low,
it is not zero because maybe a miracle will happen. The probability of success if
you estimate at 11 months and 3 weeks is actually very high, and it is also fairly
high for 11 months. However, it is unlikely the project can complete in 9 months.
Therefore, to the left of the normal estimation is a tipping point where the prob-
ability of success drastically improves in a nonlinear way. Similarly, this 1-year
project could last 13 months, and even 14 months is reasonable. But if you give
this project 18 or 24 months, you will surely kill it because Parkinson’s law will
kick in: Work will expand to fi ll the allotted time, and the project will fail due
to the increased complexity. Therefore, another tipping point exists to the right
of the normal estimation, where the probability of success again collapses in a
nonlinear way.
160 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
N Time
ESTIMATION TECHNIQUES
The poor track record with estimations in the software industry persists even
though a decent set of effective estimation techniques have been available for
decades and across multiple other industries. I have yet to see a team that has
practiced estimations correctly and was also off the mark with their project design
and commitments. Instead of trying to review all of these techniques, this section
highlights some of the ideas and techniques I have found over the years to be the
most simple and effective.
Estimations also must match the tracking resolution. If the project manager tracks
the project on a weekly basis, any estimation less than a week is pointless because
it is smaller than the measurement resolution. Doing so makes as much sense as
estimating the size of your house down to the micron when using a measuring tape
for the actual measurement.
Reduce Uncertainty
Uncertainty is the leading cause of missed estimations. It is important not to con-
fuse the unknown with the uncertain. For example, while the exact day of my
demise is unknown, it is far from uncertain, and a whole industry (life insurance)
is based on the ability to estimate that date. While the estimation may not be
precise when it comes to me specifically, the life insurance industry has sufficient
customers to make it accurate enough.
When asking people to estimate, you should help them overcome their fear of
estimations. Many may have had their poor estimations used against them in the
past. You may even encounter refusal to estimate in the form of “I don’t know”
or “Estimations never work.” Such attitudes may indicate fear of entrapment, or
trying to avoid the effort of estimating, or being ignorant and inexperienced in
estimation techniques, rather than a fundamental inability to estimate.
• Ask fi rst for the order of magnitude: Is the activity more like a day, a week, a
month, or a year? With the magnitude known, narrow it down using factor of
2 to zoom in. For example, if the answer to the fi rst question was a month as
the type of unit, ask if it is more like two weeks, one month, two months, or
four months. The fi rst answer rules out eight months (since that is more like a
year as an order of magnitude), and it cannot be one week because that was not
provided in the first place as an order of magnitude.
162 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
• Make an explicit effort to list the areas of uncertainty in the project and focus
on estimating them. Always break down large activities into smaller, more man-
ageable activities to greatly increase the accuracy of the estimations.
• Invest in an exploratory discovery effort that will give insight into the nature of
the problem and reduce the uncertainty. Review the history of the team or the
organization, and learn from your own history how long things have taken in
the past.
PERT Estimations
One estimation technique dealing specifically with high uncertainly is part of
Program Evaluation and Review Technique (PERT). 3 For every activity, you pro-
vide three estimations: the most optimistic, the most pessimistic, and the most
likely. The fi nal estimation is provided by this formula:
O+4*M+P
E=
6
where:
10 + 4 * 25 + 90
E= = 33.3
6
3. https://en.wikipedia.org/wiki/Program_evaluation_and_review_technique
E FFORT E STIMATIONS 163
if the detailed project design was 13 months and the overall project estimation was
11 months, then the detailed project design is valid. But if the overall estimation
was 18 months, then at least one of these numbers is wrong, and you must investi-
gate the source of the discrepancy. You can also utilize the overall project estima-
tion when dealing with a project with very few up-front constraints. Such a clean
canvas project has a great deal of unknowns, making it difficult to design. You can
use the overall project estimation to work backward to box in certain activities as
a way of initiating the project design process.
Historical Records
With overall project estimation, your track record and history matter the most.
With even a modest degree of repeatability (see Figure 6-1), it is unlikely that you
could deliver the project faster or slower than similar projects in the organiza-
tion’s past. The dominant factor in throughput and efficiency is the organization’s
nature, its own unique fi ngerprint of maturity, which is something that does not
change overnight or between projects. If it took your company a year to deliver
a similar project in the past, then it will take it a year in the future. Perhaps this
project could be done in six months somewhere else, but with your company it will
take a year. There is some good news here, though: Repeatability also means the
company likely will not take two or three years to complete the project.
Estimation Tools
A great yet little-known technique for overall project estimation is leveraging proj-
ect estimation tools. These tools typically assume some nonlinear relationship
exists between size and cost, such as a power function, and use a large number
of previously analyzed projects as their training data. Some tools even use Monte
Carlo simulations to narrow down the range of the variables based on your proj-
ect attributes or historical records. I have used such tools for decades, and they
produce accurate results.
Broadband Estimation
The broadband estimation is my adaptation of the Wideband Delphi4 estimation
technique. The broadband estimation uses multiple individual estimations to iden-
tify the average of the overall project estimation, then adds a band of estimations
above and below it. You use the estimations outside the band to gain insight into
the nature of the project and refi ne the estimations, repeating this process until the
band and the project estimations converge.
To start any broadband estimation effort, fi rst assemble a large group of project
stakeholders, ranging from developers to testers, managers, and even support peo-
ple—diversity of the group is key with the broadband technique. Strive for a mix
of newcomers and veterans, devil’s advocates, experts and generalists, creative
people, and worker bees. You want to tap into the group’s synergy of knowl-
edge, intelligence, experience, intuition, and risk assessment. A good group size
is between 12 and 30 people. Using fewer than 12 participants is possible, but the
statistical element may not be strong enough to produce good results. With more
than 30 participants, it is difficult to fi nish the estimation in a single meeting.
Begin the meeting by briefly describing the current state and phase of the proj-
ect, what you have already accomplished (such as architecture), and additional
contextual information (such as the system’s operational concepts) that may not
be known to stakeholders who were not part of the core team. Each participant
needs to estimate two numbers for the project: how long will it take in months
and how many people it will require. Have the estimators write these numbers,
along with their name, on a note. Collect the notes, enter them in a spreadsheet,
and calculate both the average and the standard deviation for each value. Now,
identify the estimations (both in time and people) that were at least one standard
deviation removed from the average—that is, those values outside the broadband
of consensus (hence the name of the technique). These are the outliers.
Instead of culling the outliers from the analysis (the common practice in most
statistical methods), solicit input from those who produced them—because they
may know something that the others do not. This is a great way of identifying the
uncertainties. Once the outliers have voiced their reasoning for the estimation and
all have heard it, you conduct another round of estimations. You repeat this pro-
cess until all estimations fall within one standard deviation, or the deviation is less
than your measurement resolution (such as one person or one month). Broadband
estimation typically converges this way by the third round.
A Word of Caution
Overall project estimation, whether done by using historical records, estimation
tools, or the broadband method, tends to be accurate, if not highly accurate. You
should compare the various overall estimations to ensure that you do, indeed, have
E FFORT E STIMATIONS 165
ACTIVITY ESTIMATIONS
You start the project design with the estimated duration of the individual activ-
ities in the project. Before you estimate individual activities, you must prepare a
meticulous list of all activities in the project, both coding and noncoding activities
alike. In a way, even that list of activities is an estimation of the actual set of activ-
ities, so the same rationale about reducing uncertainties holds true here. Avoid
the temptation to focus on the structural coding activities indicated by the system
architecture, and actively look below the waterline at the full extent of the iceberg.
Invest time in looking for activities, and ask other people to compile that list so
you could compare it with your own list. Have colleagues review, critique, and
challenge your list of activities. You may be surprised by what you actually missed.
• The system architecture. You must have the decomposition of the system into
services and other building blocks such as Clients and Managers. While you
could design a project with even a bad architecture, that is certainly less than
ideal. A bad system design will keep changing, and with it, your project design
will change. It is crucial that the system architecture be valid, so that it holds
true over time.
• A list of all project activities. Your list must contain both coding and noncoding
activities. It is straightforward to derive the list of most coding activities by exam-
ining the architecture. The list of noncoding activities is obtained as discussed
previously and is also a product of the nature of the business. For example, a
banking software company will have compliance and regulatory activities.
• Activity effort estimation. Have an accurate estimation of the effort for each
activity in the list of activities. You should use multiple estimation techniques
to drive accuracy.
• Services dependency tree. Use the call chains to identify the dependencies
between the various services in the architecture.
• Activity dependencies. Beyond the dependencies between your services, you
must compile a list of how all activities depend on other activities, coding and
noncoding alike. Add explicit integration activities as needed.
• Planning assumptions. You must know the resources available for the proj-
ect or, more correctly, the staffi ng scenarios that your plan calls for. If you
C RITICAL PATH A NALYSIS 167
have several such scenarios, then you will have a different project design for
each availability scenario. The planning assumptions will include which type of
resource is required at which phase of the project.
PROJECT NETWORK
You can graphically arrange the activities in the project into a network diagram.
The network diagram shows all activities in the project and their dependencies.
You fi rst derive the activity dependencies from the way the call chains propagate
through the system. For each of the use cases you have validated, you should
have a call chain or sequence diagram showing how some interaction between
the system’s building blocks supports each use case. If one diagram has Client A
calling Manager A and a second diagram has Client A calling Manager B, then
Client A depends on both Manager A and Manager B. In this way, you system-
atically discover the dependencies between the components of the architecture.
Figure 7-4 shows the dependency chart of the code modules in a sample Method-
based architecture.
Client B Client A
Manager A Manager B
Resource Resource
A B
Utilities
The dependency chart shown in Figure 7-4 has several problems. First, it is highly
structural and is missing all the nonstructural coding and noncoding activities.
Second, it is graphically bulky and with larger projects would become visually too
crowded and unmanageable. Third, you should avoid grouping activities together,
as is the case with the Utilities in the figure.
You should turn the diagram in Figure 7-4 into the detailed abstract chart shown
in Figure 7-5. That chart now contains all activities, coding and noncoding alike,
such as architecture and system testing. You may want to also add a side legend
identifying the activities for easy review.
1 2 3 7 8 16
18
6 13
1 Requirements
9 15
2 Architecture 19
3 Project Design 12
4 Test Plan 11
5 Test Harness 14
6 Logging
7 Security
8 Pub/Sub 4 10
9 Resource A 17 20 21
10 Resource B
11 Resource Access A
12 Resource Access B
13 Resource Access C
14 EngineA 5
15 EngineB
16 EngineC
17 ManagerA
18 ManagerB
19 Client App1
20 Client App2
21 System Testing
Activity Times
The effort estimation for an activity alone does not determine when that activity
will complete: Dependencies on other activities also come into play. Therefore, the
time to finish each activity is the product of the effort estimation for that activity
plus the time it takes to get to that activity in the project network. The time to get
to an activity, or the time it takes to be ready to start working on the activity, is the
maximum of time of all network paths leading to that activity. In a more formal
C RITICAL PATH A NALYSIS 169
manner, you calculate the time for completing activity i in the project with this
recursive formula:
where:
The time for each of the preceding activities is resolved the same way. Using regres-
sion, you can start with the last activity in the project and fi nd the completion time
for each activity in the network. For example, consider the activity network in
Figure 7-6.
3 6
2 4
In the diagram in Figure 7-6, activity 5 is the last activity. Thus, the set of regres-
sion expressions that defi ne the time to fi nish activity 5 are:
T5 = E5 + Max(T3 ,T6 )
T6 = E6
T3 = E3 + Max(T2 ,T4 )
T4 = E4
T2 = E2 + T1
T1 = E1
Note that the time to fi nish activity 5 depends on the effort estimation of the pre-
vious activities as much as it depends on the network topology. For example, if all
the activities in Figure 7-6 are of equal duration, then:
T5 = E1 + E2 + E3 + E5
170 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
However, if all activities except activity 6 are estimated at 5 days, and activity 6
is estimated at 20 days, then:
T5 = E6 + E5
While you could manually calculate the activity times for small networks such
as Figure 7-6, this calculation quickly gets out of hand with large networks.
Computers excel at regression problems, so you should use tools (such as Microsoft
Project or a spreadsheet) to calculate activity times.
14 15 16 17
12 13
9 8 10 11
4 3 5 6 7
1 2
Based on the effort estimation for each activity and the dependencies, using the
formula given earlier and starting from activity 17, the longest path in the net-
work is shown in bold. That longest path in the network is called the critical path.
You should highlight the critical path in your network diagrams using a different
color or bold lines. Calculating the critical path is the only way to answer the
question of how long it will take to build the system.
Because the critical path is the longest path in the network, it also represents the
shortest possible project duration. Any delay on the critical path delays the entire
project and jeopardizes your commitments.
C RITICAL PATH A NALYSIS 171
No project can ever be accelerated beyond its critical path. Put another way, you
must build the system along its critical path to build the system the quickest pos-
sible way. This is true in any project, regardless of technology, architecture, devel-
opment methodology, development process, management style, and team size.
In any project with multiple activities on which multiple people are working, you
will have a network of activities with a critical path. The critical path does not
care if you acknowledge it or not; it is just there. Without critical path analysis, the
likelihood of developers building the system along the critical path is nearly zero.
Working this way is likely to be substantially slower.
Note While the discussion so far has referred to the critical path of the proj-
ect as a singular path, a project can absolutely have multiple critical paths (all
of equal duration), including having all network paths as critical paths. Proj-
ects with multiple critical paths are risky because any delay on any of these
paths will delay the project.
ASSIGNING RESOURCES
During project design, the architect assigns abstract resources (such as Developer
1) to each of the project design options. Only after the decision makers have chosen
a particular project design option can the project manager assign actual resources.
Since any delay in the critical path will delay the project, the project manager
should always assign resources to the critical path fi rst. You should take matters a
step further by always assigning your best resources to the critical path. By “best,”
I mean the most reliable and trustworthy developers, the ones who will not fail
to deliver. Avoid the classic mistake of first assigning developers to high-visibility
but noncritical activities, or to activities that the customer or management care
the most about. Assigning development resources fi rst to noncritical activities does
nothing to accelerate the project. Slowing down the critical path absolutely slows
down the project.
Staffing Level
During project design, for each project design option the architect needs to fi nd
out how many resources (such as developers) the project will require overall. The
architect discovers the required staffing level iteratively. Consider the network in
Figure 7-7, where the critical path is already identified, and assume each node is a
service. How many developers are required on the fi rst day of the project? If you
were given just a single developer, that developer is by defi nition your best devel-
oper, so the single developer goes to activity 1. If you are given two developers,
172 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
then you can assign the second developer to activity 2, even though that activity
is not required until much later. If you are given three developers, then the third
developer is at best idle, and at worst disrupting the developer working on activity
1. Therefore, the answer to the question of how many developers are required on
day 1 of the project is at most two developers.
Next, suppose activity 1 is complete. How many developers are required now?
The answer is at most six (activities 3, 4, 5, 6, 7, and 2 are available). However,
asking for six developers is less than ideal since by the time you have progressed
up the critical path to the level of activities 8 or 12, you need only three or even
two developers. Perhaps it is better to ask for just four developers instead of six
developers once activity 1 is complete. Utilizing only four as opposed to six devel-
opers has two significant advantages. First, you will reduce the cost of the proj-
ect. A project with four developers is 33% less expensive than a project with six
developers. Second, a team of four developers is far more efficient than a team of
six developers. The smaller team will have less communication overhead and less
temptation for interference from the idle hands.
Based on this criterion alone, a team of three or even two developers would be
better than a team of four developers. However, when examining the network
of Figure 7-7 it is likely impossible to build the system with just three developers
and keep the same duration. With so few developers, you will paint yourself into
a corner in which a developer on the critical path needs a noncritical activity that
is simply not ready yet (such as activity 15 needing activity 11). This promotes a
noncritical activity to a critical activity, in effect creating a new and longer critical
path. I call this situation subcritical staffing. When the project goes subcritical, it
will miss its deadline because the old critical path no longer applies.
The real question is not how many resources are required. The question to ask at
any point of the project is:
What is the lowest level of resources that allows the project to progress unimpeded
along the critical path?
Finding this lowest level of resources keeps the project critically staffed at all
points in time and delivers the project at the least cost and in the most efficient
way. Note that the critical level of staffing can and should change throughout the
life of the project.
Imagine a group of developers without project design. The likelihood of that group
constituting the lowest level of resources required to progress unimpeded along the
C RITICAL PATH A NALYSIS 173
critical path is nearly zero. The only way to compensate for the unknown staffing
needs of the project is by using horrendously wasteful and inefficient overcapacity
staffing. As illustrated previously, working this way cannot be the fastest way of
completing the project—and now you see it also cannot be the least costly way
of building the system. My experience is that overcapacity can be more expensive
than the lowest cost level by many multiples.
Float-Based Assignment
Returning to the network in Figure 7-7, once you have concluded that you could
try to build the system with only four developers, you face a new challenge: Where
and when will you deploy these four developers? For example, with activity 1
complete, you could assign the developers to activities 3, 4, 5, 6 or 3, 5, 6, 7, or
3, 4, 6, 2, and so on. Even with a simple network, the combinatorial spectrum of
possibilities is staggering. Each of these options would have its own set of possible
downstream assignments.
Fortunately, you do not have to try any of these combinations. Examine activity 2 in
Figure 7-7. You can actually defer assigning resources to activity 2 until the day
that activity 16 (which is on the critical path) must start, minus the estimated
duration of activity 2. Activity 2 can “float” to the top (remain unassigned and
not start) until it bumps against activity 16. All noncritical activities have float,
which is the amount of time you could delay completing them without delaying the
project. Critical activities have no float (or more precisely, their float is zero) since
any delay in these activities would delay the project. When you assign resources to
the project, follow this rule:
To figure out how to assign developers in the previous example once activity 1
is complete, calculate the float of all activities that are possible once activity 1 is
complete, and assign the four developers based on the float, from low to high. First,
assign a developer to the critical path, not because it is special but because it has
the lowest possible float. Now, suppose activity 2 has 60 days of float and activity 4
has 5 days of float. This means that if you defer getting to activity 4 by more than
5 days, you will derail the project. By contrast, you could defer getting to activity
2 by at most 60 days, so you assign the next developer to activity 4. During the
intervening time while activity 2 remains unassigned, you are in effect consuming
the activity’s float. Perhaps by the time the float of activity 2 has become 15 days,
you will be finally able to assign a developer to this activity.
174 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
CLASSIC PITFALL
As observed by Tom Demarco,a most organizations incentivize their managers
to do the wrong thing when it comes to project staffing, even when starting with
the best of intentions. Managers can correctly assign developers to the project
only after project design, which is possible only after the architecture is com-
plete. These design activities, while short in nature, conclude the fuzzy front
end of the project, which itself may take months of scoping the work, prototyp-
ing, evaluating technologies, interviewing customers, analyzing requirements,
and more. There is no point in hiring developers until the project manager can
assign them based on the plan, because otherwise they will have nothing to do.
However, empty offices and desks, for months on end, reflect poorly on the
manager, making it seem as if the manager is just slacking. The manager fears
that when (not if) the project is late (as software projects are known to be), the
manager will get the blame because the manager did not hire the developers
at the beginning of the project. To avoid this liability, as soon as the fuzzy
front end starts, the manager will hire developers to avoid empty offices. The
developers still have nothing to do, so they will play games, read blogs, and
take long lunch breaks. Unfortunately, this behavior reflects even worse on
the manager than the empty offices, because now the perception is that the
manager does not know how to delegate and manage, and the organization
has to pay for it, too.
Again, the manager fears that if the project is late, the manager will be the
one left holding the bag. As soon as the front end starts, the manager will staff
the project and assign feature A to the first developer, feature B to the second
developer, and so on, even though the project lacks a sound architecture or
critical path analysis. When several weeks or months later the architect pro-
duces the architecture and the project design, they will be irrelevant, since the
developers have been working on a completely different system and project. The
project will grossly miss the schedule and blow through any set budget, not just
because of the lack of architecture and critical path analysis, but also because
what took place instead was both functional decomposition of the system and
functional decomposition of the team.
be pleading for more time and resources from top management. When the proj-
ect is late (again as most projects are in the software industry), the manager
looks no worse than any other manager in the organization.
Doing the right thing is a lot easier the second time around, when you have
already proven that you know how to deliver on time and on budget. The orga-
nization may never understand how it worked (or why the way other managers
always try it does not work), but cannot argue with results. The fi rst time going
this route, and without a track record of success, you will face a struggle. The
best action is to confront this pitfall head on and make resolving it part of your
project design, as described in Chapter 11.
The nature of this process is iterative both because initially the lowest level of staff-
ing is unknown and because using float-based assignment changes the floats of the
activities. Start by attempting to staff the project with some resource level, such as
six resources, and then assign these resources based on float. Every time a resource
is scheduled to finish an activity, you scan the network for the nearest available
activities, choosing the activity with the lowest float as the next assignment for that
resource. If you successfully staff the project, try again, this time with a reduced
staffing level such as five or even four resources. At some point, you will have an
excess of activities compared with the available resources. If those unassigned activ-
ities have high enough float, you could defer assigning resources to them until some
resources become available. While these activities are unassigned, you will be con-
suming their float. If the activities become critical, then you cannot build the project
with that staffing level, and you must settle for a higher level of resources.
For example, if you were to assign the network depicted in Figure 7-7 to a single
developer, the actual network diagram would be a long string, not Figure 7-7.
The dependency on the single resource drastically changes the network diagram.
Therefore, the network diagram is actually not just a network of activities, but
fi rst and foremost a network of dependencies. If you have unlimited resources
and very elastic staffing, then you can rely only on the dependencies between the
activities. Once you start consuming float, you must add the dependencies on the
resources to the network. The key observation here is:
The actual way of assigning resources to the project network is a product of multi-
ple variables. When you assign resources you must take the following into account:
• Planning assumptions
• Critical path
• Floats
• Available resources
• Constraints
These will always result in several project design options, even for straightforward
projects.
SCHEDULING ACTIVITIES
Together, the project network, the critical path, and the float analysis allow you to
calculate the duration of the project as well as when each activity should start with
respect to the project beginning. However, the information in the network is based
on workdays, not on calendar dates. You need to convert the information in the
network to calendar dates by scheduling the activities. This is a task that you can
easily perform by using a tool (such as Microsoft Project). Defi ne all activities in
the tool, then add dependencies as predecessors, and assign the resources accord-
ing to your plan. Once you select a start date for the project, the tool will schedule
all activities. The output may also include a Gantt chart, but that is incidental to
the core piece of information you can now glean from the tool: the planned start
and completion dates for each activity in the project.
S CHEDULING A CTIVITIES 177
Caution Gantt charts in isolation are detrimental because they may give
management the illusion of planning and control. A Gantt chart is merely
one view of the project network, and it does not encompass the full project
design.
STAFFING DISTRIBUTION
The required staffing for your project is not constant with time. At the beginning,
you need only the core team. Once management selects a project design option and
approves the project, you can add resources such as developers and testers.
Not all resources are needed all at once due to the dependencies and the critical
path. Much the same way, not all resources are retired uniformly. The core team
is required throughout, but developers should not be needed through the last day
of the project. Ideally, you should phase in developers at the beginning of the
project as more and more activities become possible, and phase out the developers
toward the end of the project.
This approach of phasing in and phasing out resources has two significant advan-
tages. First, it avoids the feast-or-famine cycles experienced by many software
projects. Even if you have the required average level of staffi ng for the project,
you could be understaffed in one part of the project and overstaffed in another
part. These cycles of idleness or intense overtime are demoralizing and very inef-
ficient. Second (and more importantly), phasing resources offers the possibility
of realizing economy of scale. If you have several projects in the organization,
then you could arrange them such that developers are always phasing out of one
project while phasing into another. Working this way yields a hundreds of percent
increase in productivity, the classic “doing much more with less.”
SDP review. If the project is terminated at that point, the staffing goes to zero and
the core team is available for other projects. If the project is approved, an initial
ramp-up in staffing occurs in which developers and other resources are working
on the lowest-level activities in the project that enable other activities. When those
activities become available, the project can absorb additional staff. At some point
you have phased in all the resources the project ever needs, reaching peak staffing.
For a while, the project is fully staffed. The system tends to appear at the end of
this phase. Now the project can phase out resources, and those left are working
on the most dependent activities. The project concludes with the level of staffing
required for system testing and release.
Peak Staffing
System
SDP Review
Core Team Only
Time
Figure 7-9 shows a staffing distribution chart that demonstrates the behavior of
Figure 7-8. You produce a chart such as Figure 7-9 by fi rst staffing the project,
then listing all the dates of interest (unique dates when activities start and end)
in chronological order. You then count how many resources are required for each
category of resources in each time period between dates of interest. Do not forget
to include in the staffing distribution resources that do not have specific activities
but are nonetheless required, such as the core team, quality control, and develop-
ers between coding activities. This sort of stacking bar diagram is trivial to do in
a spreadsheet. The fi les accompanying this book contain several example projects
and templates for these charts.
Since the dates of interest may not be regularly spaced, the bars in the staffing
distribution chart may vary in time resolution. However, in most decent-size proj-
ects with enough activities, the overall shape of the chart should follow that of
S CHEDULING A CTIVITIES 179
Figure 7-8. By examining the staffing distribution chart, you get a quick and valu-
able feedback on the quality of your project design.
10
Developers
8 Testers
Architect
Product Manager
Staffing
6
Project Manager
4
0
01/03 02/04 04/08 05/06 06/03 07/08 08/05 09/09 10/07 11/04 12/09
Date
Staffing Mistakes
Several common project staffing mistakes may be evident in the staffing distribu-
tion chart. If the chart looks rectangular, it implies constant staffing—a practice
against which I have already cautioned.
A staffing distribution with a huge peak in the middle of the chart (as shown in
Figure 7-10) is also a red flag: Such a peak always indicates waste.
Staffing
Time
Consider the effort expended in hiring people and training them on the domain,
architecture, and technology when you use them for only a short period of time.
A peak is usually caused by not consuming enough float in the project, resulting in
a spike in resource demand. If the project were to trade some float for resources,
the curve would be smoother. Figure 7-11 depicts a sample project with a peak in
staffing.
20 Developers
18 Testers
16 UX
14 DevOps
Test Engineer
12
Staffing
Architect
10
Product Manager
8 Project Manager
6
4
2
0
4/13 5/4 6/5 6/15 7/3 7/24 8/3 8/21 9/4 9/14 10/2 11/27 12/7 12/18 1/15 2/1 3/4 4/15 6/10 7/22 8/15 8/26
Date
A flat line in the staffing distribution chart (as shown in Figure 7-12) is yet another
classic mistake. The flat line indicates the absence of the high plateau of Figure 7-8.
The project is likely subcritical and is missing the resources to staff the noncritical
activities of the original plan.
Staffing
Time
Figure 7-13 shows the staffing distribution for a sample subcritical project. This
project goes subcritical at a level of 11 or 12 resources. It is not just missing the
plateau, but has a valley instead.
S CHEDULING A CTIVITIES 181
Test Engineers
Developers
Testers
12 DevOps
Architect
10 Product Manager
Project Manager
8
Staffing
0
1/23 2/20 3/20 4/24 5/22 6/29 7/27 8/24 9/28 11/02 11/30 12/28 01/25
Date
Erratic staffi ng distributions (as in Figure 7-14) are yet another distress signal.
Projects that are designed with this kind of elasticity in mind are due for a dis-
appointment (see Figure 7-15) because staffi ng can never be that elastic. Most
projects cannot conjure people out of thin air, have them be instantly productive,
and then dispose of them a moment later. In addition, when people constantly
come and go from a project, training (or retraining) them is very expensive.
It is difficult to hold people accountable or retain their knowledge under such
circumstances.
Staffing
Time
Test Engineers
DevOps
Developers
12 Testers
Architect
10 Product Manager
Project Manager
8
Staffing
0
1/5 2/2 3/2 3/30 4/27 5/25 6/29 7/25 8/24 9/28 10/26 11/30 12/28
Date
Figure 7-16 illustrates another staffi ng distribution to avoid, the high ramp-up
coming into the project. While this figure does not include any numbers, the chart
clearly indicates wishful thinking. No team can instantly go from zero to peak
staffing and have everyone add value and deliver high-quality, production-worthy
code. Even if the project initially has that much parallel work, and even if you have
the resources, the network downstream throttles how many resources the project
can actually absorb beyond that, and the required staffing fi zzles out.
Staffing
Time
Figure 7-17 demonstrates such a project. This plan expects instantaneously to get
to 11 people, and shortly afterward deflates to around six people until the end of
S CHEDULING A CTIVITIES 183
12
Test Engineer
Developers
10 Testers
Architect
8 Product Manager
Project Manager
Staffing
0
01/01 1/29 2/26 3/26 4/30 5/28 6/25 7/30 8/27
Date
the project. It is improbable that any team can ramp up this way, and the available
resources are used inefficiently due to the oversized team.
As just mentioned, the two root causes of incorrect staffing are assuming too elas-
tic staffing and not consuming float when assigning resources. When considering
staffing elasticity, you have to know your team and have a good grasp on what is
feasible as far as availability and efficiency. The degree of staffing elasticity also
depends on the nature of the organization and the quality of the system and proj-
ect design. The better the designs, the more quickly developers can come to terms
with the new system and activities. Consuming float is easy to do in most projects
and likely to reduce both the volatility in the staffing and the absolute level of the
required staffing. Being more realistic about staffing elasticity and consuming float
often eliminate the peaks, the ups-and-downs, and the high ramp-ups.
PROJECT COST
Plotting the staffing distribution chart for each project design option is a great
validation aid in reflecting on the option and seeing if it makes sense. With project
design, if something does not feel right, more often than not, something is indeed
wrong.
Drawing the staffing distribution chart offers another distinct benefit: It is how
you figure out the cost of the project. Unlike physical construction projects, soft-
ware projects do not have a cost of goods or raw materials. The cost of software
is overwhelmingly in labor. This labor includes all team members, from the core
team to the developers and testers. Labor cost is simply the staffing level multiplied
by time:
Multiplying staffi ng by time is actually the area under the staffi ng distribution
chart. To calculate the cost, you need to calculate that area.
The staffing distribution chart is a discrete model of the project that has vertical
bars (the staffing level) in each time period between dates of interest. You calculate
the area under the staffi ng distribution chart by multiplying the height of each
vertical bar (the number of people) by the duration of the time period between its
dates of interest (Figure 7-18). You then sum the results of these multiplications.
Si
Staffing
Time
Ti–1 Ti
The formula for the calculation of the area under the staffing chart is:
n
Cost = ∑(S *(T −T
i=1
i i i−1 ))
where:
Finding the area under the staffing distribution chart is the only way to answer the
question how much the project will cost.
If you use a spreadsheet to produce the staffing distribution chart, you just need to
add another column with a running sum to calculate the area under the chart (in
essence, a numerical integration). The support files accompanying this book contain
several examples of this calculation.
Since cost is defi ned as staffi ng multiplied by time, the units of cost should be
effort and time, such as man-month or man-year. It is better to use these units as
opposed to currency to neutralize differences in salary, local currencies and bud-
gets. It then becomes possible to objectively compare the cost of different project
design options.
Given the architecture, the initial work breakdown, and the effort estimation, it is
a matter of a few hours to a day at the most to answer the questions of how long it
will take and how much it will cost to build the system. Sadly, most software proj-
ects are running blind. This is as sensible as playing poker without ever looking at
the cards—except, instead of chips, you have your project, your career prospects,
or even the company’s future on the line.
PROJECT EFFICIENCY
Once the project cost is known, you can calculate the project efficiency. The
efficiency of a project is the ratio between the sum of effort across all activities
(assuming perfect utilization of people) and the actual project cost. For example,
if the sum of effort across all activities is 10 man-months (assuming 30 workdays
in a month), and the project cost is 50 man-months (of regular workdays), then the
project efficiency is 20%.
186 C HAPTER 7 P ROJECT D ESIGN O VERVIEW
The project efficiency is a great indicator of the quality and sanity of the project’s
design. The expected efficiency of a well-designed system, along with a properly
designed and staffed project, ranges between 15% and 25%.
These efficiency rates may seem appallingly low, but higher efficiency is actually
a strong indicator of an unrealistic project plan. No process in nature can ever
even approach 100% efficiency. No project is free from constraints, and these
constraints prevent you from leveraging your resources in the most efficient way.
By the time you add the cost of the core team, the testers, the Build and DevOps,
and all the other resources associated with your project, the portion of the effort
devoted to just writing code is greatly diminished. Projects with high efficiency
such as 40% are simply impossible to build.
Even 25% efficiency is on the high side and is predicated on having a correct
system architecture that will provide the project with the most efficient team (see
Figure 7-1) and a correct project design that uses the smallest level of resources
and assigns them based on floats. Additional factors required for delivering on
high efficiency expectations include a small, experienced team whose members
are accustomed to working together, and a project manager who is committed to
quality and can handle the complexity of the project.
Efficiency also correlates with staffing elasticity. If staffing were truly elastic (i.e.,
you could always get resources just when you need them and let them go at the
precise moment when you no longer need them), the efficiency would be high. Of
course, staffing is never that elastic, so sometimes resources will be idle while still
assigned to the project, driving the efficiency down. This is especially the case
when utilizing resources outside the critical path. If a single person is working on
all critical activities, that person is actually at peak efficiency because the person
works on activities back-to-back, and the cost of that effort approaches the sum of
the cost of the critical activities. With noncritical activities, there is always float.
Since staffing is never truly elastic, the resources outside the critical path can never
be utilized at very high efficiency.
If the project design option has a high expected efficiency, you must investigate
the root cause. Perhaps you assumed too liberal and elastic staffing or the project
network is too critical. After all, if most network paths are either critical or near-
critical (most activities have low float), then you would get a high efficiency ratio.
However, such a project is obviously at high risk of not meeting its commitments.
E ARNED VALUE P L ANNING 187
You can use efficiency as yet another broad project estimation technique. Suppose
you know that historically your projects were 20% efficient. Once you have your
individual activity breakdown and their estimations, simply multiply the sum of
effort (assuming perfect utilization) across all activities by 5 to produce a rough
overall project cost.
∑E i
EV(t) = i=1
N
∑E
i=1
i
where:
The earned value at time t is the ratio between the sum of estimated duration of
all activities completed by time t divided by the sum of the estimated durations
of all activities.
Front End 40 20
Access Service 30 15
UI 40 20
Manager Service 20 10
Utility Service 40 20
System Testing 30 15
The sum of estimated duration across all activities in Table 7-1 is 200 days. The
UI activity, for example, is estimated at 40 days. Since 40 is 20% of 200, you could
state that by completing the UI activity, you have earned 20% toward the comple-
tion of the project. From your scheduling of activities you also know when the UI
activity is scheduled to complete, so you can actually calculate how you plan to
earn value as a function of time (Table 7-2).
Start 0 0 0
Front End t1 20 20
Access Service t2 15 35
UI t3 20 55
Manager Service t4 10 65
Utility Service t5 20 85
Such a chart of planned progress is shown in Figure 7-19. By the time the project
reaches the planned completion date, it should have earned 100% of the value.
The key observation in Figure 7-19 is that the pitch of the planned earned value
curve represents the throughput of the team. If you were to assign exactly the
same project to a better team, they would meet the same 100% of earned value
sooner, so their line would be steeper.
E ARNED VALUE P L ANNING 189
100%
Earned Value
Planned
Completion
Date
Time
CLASSIC MISTAKES
The realization that you can gauge the expected throughput of the team from the
earned value chart enables you to quickly discern mistakes in your project plan.
For example, consider the planned earned value chart in Figure 7-20. No team in
the world could ever deliver on such a plan. For much of the project, the expected
throughput was shallow. What kind of miracle of productivity would deliver the
rocket launch of earned value toward the end of the project?
Earned Value
Time
Such unrealistic, overly optimistic plans are usually the result of back-scheduling.
The plan may even start with the best of intentions, progressing along the critical
path. Unfortunately, you fi nd that someone has already committed the project on
a specific date with no regard for a project design or the actual capabilities of the
team. You then take the remaining activities and cram them against the deadline,
basically back-scheduling from that. Only by plotting the planned earned value
will you be able to call attention to the impracticality of this plan and try to avert
failure. Figure 7-21 depicts a project with such behavior.
100%
90%
80%
70%
Earned Value
60%
50%
40%
30%
20%
10%
0%
9/16 10/16 11/15 12/15 1/14 2/13 3/15 4/14 5/14 6/16
Time
In much the same way, you can detect unrealistically pessimistic plans such as that
shown in Figure 7-22. This project starts out well, but then productivity is expected to
suddenly diminish—or more likely, the project was given much more time than was
required. The project in Figure 7-22 will fail because it allows for gold plating and
complexity to raise their heads. You can even extrapolate from the healthy part of the
curve when the project should have finished (somewhere above the knee in the curve).
size fi xed. A properly staffed and well-designed project always results in a shallow
S curve for the earned value chart, as shown in Figure 7-23.
Earned Value
Time
End of System
Testing
100%
Construction
Earned Value
Done
SDP Review
Planned
Completion
Date
Time
The shape of the planned earned value curve is related to the planned staffing dis-
tribution. At the beginning of the project, only the core team is available, so not
much measurable value is added at the front end, and the pitch of the earned value
curve is almost flat. After the SDP review, the project can start adding people.
As you increase the size of the team, you will also increase its throughput, so the
earned value curve gets steeper and steeper. At some point you reach peak staffing.
For a while the team size is mostly fi xed, so there is a straight line at maximum
throughput in the center of the curve. Once you start phasing out resources, the
earned value curve levels off until the project completes. Figure 7-24 shows a sam-
ple shallow S curve.
Every process that involves change can be modeled using a logistic function.
For example, the temperature in a room rises and falls according to a logistic
function, as does your body weight, the market share of a company, radio-
active decay, the risk of burning your skin as a function of distance from a
flame, statistical distributions, population growth, effectiveness of design, the
intelligence of neural networks, and pretty much everything else. The logistic
function is the single most important function known to mankind because it
enables us to quantify and model the world—a world that is highly dynamic.
a. https://en.wikipedia.org/wiki/Logistic_function
E ARNED VALUE P L ANNING 193
100%
90%
80%
70%
Earned Value
60%
50%
40%
30%
20%
10%
0%
07/01 10/09 01/17 04/26 08/04 11/12
Date
0.5
The earned value curve is a simple and easy way to answer the question: “Does
the plan make sense?” If the planned earned value is a straight line, or it exhibits
the issues of Figure 7-20 or Figure 7-22, the project is in danger. If it looks like a
shallow S, then at least you have hope that the plan is sound and sensible.
The architect designs the project as a continuous design effort following the sys-
tem design. This process is identical to that used in every other engineering disci-
pline: The design of the project is part of the engineering effort and is never left for
the construction workers and foremen to figure out on-site or on the factory floor.
The architect is not responsible for managing and tracking the project. Instead,
the project manager assigns the actual developers to the project and tracks their
progress against the plan. When things change during execution, both the project
manager and the architect need to close the loop together and redesign the project.
The realization that the architect should design the project is part of the maturity
of the role of the architect. The demand for architects has emerged in the late
1990s in response to the increased cost of ownership and complexity of software
systems. Architects are now required to design systems that enable maintainability,
reusability, extensibility, feasibility, scalability, throughput, availability, respon-
siveness, performance, and security. All of these are design attributes, and the way
to address them is not via technology or keywords, but with correct design.
However, that list of design attributes is incomplete. This chapter started with the
defi nition of success, and to succeed you must add to that list schedule, cost, and
risk. These are design attributes as much as the others, and you provide them by
designing the project.
8
NETWORK AND FLOAT
The project network acts as a logical representation of the project for planning
purposes. The technique for analyzing the network is called the critical path
method, although it has as much to do with the noncritical activities as it does
with the critical ones. Critical path analysis is admirably suited for complex proj-
ects, ranging from physical construction to software systems, and it has a decades-
long proven track record of success. By performing this analysis, you fi nd the
project duration and determine where and when to assign your resources.
Network diagrams are often deliberately not shown to scale so that you can focus
purely on dependencies and general topology of the network. Avoiding scale in
most cases also simplifies the design of the project. Attempts to keep the network
diagram to scale will impose a serious burden when estimations change, when you
add or remove activities, or when you reschedule activities.
There are two possible representations of a project network diagram: a node dia-
gram and an arrow diagram (Figure 8-1).
195
196 C HAPTER 8 N ET WORK AND F LOAT
5
3 6
3 6
2 4
2 4
1
Figure 8-1 A node diagram (left) and the equivalent arrow diagram (right)
With an arrow diagram, all activities must have a start event and a completion
event. It is also good practice to add an overall start and completion event for the
project as a whole.
THE N ET WORK D IAGR AM 197
Dummy Activities
Suppose in the network of Figure 8-1, activity 4 also depends on activity 1. If activ-
ity 2 already depends on activity 1, the arrow diagram has a problem, because you
cannot split the arrow of activity 1. The solution is to introduce a dummy activity
between the completion event of activity 1 and the start event of 4 (shown as a
dashed arrow in Figure 8-2). The dummy activity is an activity of zero duration
whose sole purpose is to express the dependency on its tail node.
3 6
2 4
Nearly everyone requires a bit of practice to correctly draw and read an arrow
diagram, whereas people intuitively draw node diagrams and understand them,
giving node diagrams what appears to be a clear advantage. Node diagrams at
fi rst seem to have no need for a dummy activity because you can just add another
dependency arrow (such as another arrow between activities 1 and 4 on the left
side of Figure 8-1). For these simplistic reasons, the vast majority of tools for
drawing network diagrams use node diagrams.
1 4
1 4
2 5
2 5
3 6
3 6
Figure 8-3 Repeated dependencies in node versus arrow diagrams [Adopted and
modified from James M. Antill and Ronald W. Woodhead, Critical Path in Construction
Practice, 4th ed. (Wiley, 1990).]
Figure 8-3 depicts two identical networks, both comprising six activities; 1, 2,
3, 4, 5, 6. Activities 4, 5, and 6 all depend on activities 1, 2, and 3. With an
arrow diagram the network is straightforward and easy to understand, while the
corresponding node diagram is an entangled cat’s cradle. You can clean up the
node diagram by introducing a dummy node of zero duration, but that may get
confused with a milestone.
As it turns out, the situation in Figure 8-3 is very common in well-designed soft-
ware systems in which you have repeated dependencies across the layers of the
architecture. For example, activities 1, 2, and 3 could be ResourceAccess services,
and activities 4, 5 and 6 could be some Managers and Engines, each using all
three ResourceAccess services. With node diagrams, it is difficult to figure out
what is going on even in a simple project network like that shown in Figure 8-3.
By the time you add Resources, Clients, and Utilities, the diagram becomes utterly
incomprehensible.
Consequently, you should avoid node diagrams and use arrow diagrams. The
initial arrow diagram learning curve is more than offset by the benefits of having
a concise, clear, clutter-free model of your project. The lack of widely available
tool support for arrow diagrams, which forces you to draw your arrow diagram
manually, is not necessarily a bad thing. Drawing the network by hand is valuable
because in the process you review and verify activity dependencies, and it may
even unveil additional insights about the project.
F LOATS 199
In the 1960s, NASA used the critical path method as the principal planning tool
for catching up and winning the race to the moon.d The critical path method
gained in reputation after the role it played in salvaging the much delayed
Sydney Opera House project,e and in ensuring the rapid construction of the
World Trade Center in New York City (then the tallest buildings in the world),
both completed in 1973.
a. https://en.wikipedia.org/wiki/Critical_path_method#history
b. https://en.wikipedia.org/wiki/Program_evaluation_and_review_technique#history
c. James E. Kelley and Morgan R. Walker, “Critical Path Planning and Scheduling,” Proceedings of
the Eastern Joint Computer Conference, 1959.
d. https://ntrs.nasa.gov/search.jsp?R=19760036633
e. James M. Antill and Ronald W. Woodhead, Critical Path in Construction Practice, 4th ed. (Wiley,
1990).
FLOATS
Activities on the critical path must complete as soon as planned to avoid delaying
the project. Noncritical events may be delayed without slipping the schedule; in
other words, they can float until they must begin. A project without any float,
where all network paths are critical, could in theory meet its commitments, but
in practice any misstep anywhere will cause a delay. From a design perspective,
floats are the project’s safety margins. When designing a project, you always want
to reserve enough float in the network. The development team can then consume
this float to compensate for unforeseen delays in noncritical activities. Low-float
projects are at high risk of delays. Anything more than a minor delay on a low-
float activity will cause that activity to become critical and derail the project.
200 C HAPTER 8 N ET WORK AND F LOAT
The discussion of floats so far was somewhat simplistic because there are actually
several types of floats. This chapter discusses two types: total float and free float.
TOTAL FLOAT
An activity’s total float is by how much time you can delay the completion of that
activity without delaying the project as a whole. When the completion of an activ-
ity is delayed by an amount less than its total float, its downstream activities may
be delayed as well, but the completion of the project is not delayed. This means
that total float is an aspect of a chain of activities, not just particular activities.
Consider the network in the top part of Figure 8-4, which shows the critical path
in bold lines and a noncritical path or chain of activities above it.
Total Float
Noncritical Activity
Critical Activity
.
.
.
.
.
.
For the purpose of this discussion, Figure 8-4 is drawn to scale so that the length of
each line corresponds to each activity’s duration. The noncritical activities all have
the same amount of total float, indicated by the red line at the end of the activity’s
arrow. Imagine that the start of the fi rst noncritical activity in the top half of the
figure is delayed or that the activity takes longer than its estimation. While that
activity executes, the delay in completing the upstream activity consumes the total
float of the downstream activities (shown in the bottom half of the figure).
All noncritical activities have some total float, and all activities on the same non-
critical chain share some of the total float. If the activities are also scheduled to
start as soon as possible, then all activities on the same chain will have the same
amount of total float. Consuming the total float somewhere further up a chain will
drain it from the downstream activities, making them more critical and riskier.
F LOATS 201
Note As you will see later in this chapter, the total float of each activ-
ity in the network is a key project design consideration. In the rest of this
book, references to just "float" always refer to total float.
FREE FLOAT
An activity’s free float is by how much time you can delay the completion of that
activity without disturbing any other activity in the project. When the completion
of an activity is delayed by an amount less or equal to its free float, the down-
stream activities are not affected at all, and of course the project as a whole is not
delayed. Consider Figure 8-5.
Free Float
Noncritical Activity
Critical Activity
.
.
.
.
.
.
Again, for the purpose of this discussion, Figure 8-5 is drawn to scale. Suppose
the fi rst activity in the noncritical chain in the top part of the figure has some free
float, indicated by the dotted red line at the end of the activity’s arrow. Imagine
that the activity is delayed by an amount of time less than (or equal to) its free
float. You can see that the downstream activities are unaware of that delay (bot-
tom part of diagram).
Interestingly, while any noncritical activity always has some total float, an activity
may or may not have free float. If you schedule your noncritical activities to start
as soon as possible, back to back, then even though these activities are noncritical,
their free float is zero because any delay will disrupt the other noncritical activities
202 C HAPTER 8 N ET WORK AND F LOAT
on the chain. However, the last activity on a noncritical chain that connects back
to the critical path always has some free float (or it would be a critical activity, too).
Free float has little use during project design, but it can prove very useful during
project execution. When an activity is delayed or exceeds its effort estimation, the
free float of the delayed activity enables the project manager to know how much
time is available before other activities in the project will be affected, if at all. If
the delay is less than the delayed activity’s free float, nothing really needs to be
done. If the delay is greater than the free float (but less than the total float), the
project manager can subtract the free float from the delay and accurately gauge
the degree by which the delay will interfere with downstream activities and take
appropriate actions.
CALCULATING FLOATS
The floats in the project network are a function of the activity durations, their
dependencies, and any delays you may introduce. None of these have to do with
actual calendar dates when you schedule these activities. You can calculate the
floats even if the actual start date of the project is as yet undecided.
In most decent-size networks, such float calculations, if done manually, are error
prone, get quickly out of hand, and are invalidated by any change to the network.
The good news is that these are purely mechanical calculations, and you should use
tools for calculating the floats.1 With the total float values at hand, you can record
them on the project network as shown in Figure 8-6. This figure shows a sample
project network in which the numbers in black are each activity’s ID, and the num-
bers in blue below the arrows are the total float for the noncritical activities.
4 13
1 5 15
5 8 10 25 5
2 5
10 11 14 16
10
3 6
30 9
30 30 12
7
5
30
1. You can use Microsoft Project to calculate the floats for each activity by inserting the Total Slack
and the Free Slack columns, which correspond to the total and free float. To learn how to man-
ually calculate floats, see James M. Antill and Ronald W. Woodhead, Critical Path in Construction
Practice, 4th ed. (Wiley, 1990).
F LOATS 203
While only total float is required for project design, you can also record free float
in the network diagram. The project manager will fi nd this information invaluable
during project execution.
VISUALIZING FLOATS
Capturing the information about the floats on the network diagram, as shown in
Figure 8-6, is not ideal. Human beings process alphanumeric data slowly and have
a hard time relating to such data. It is difficult to examine complex networks (or
even simple ones, like that in Figure 8-6) and at a glance to assess the criticality
of the network. The criticality of the network indicates where the risky areas are
and how close the project is to an all-critical network. Total floats are better visu-
alized by color-coding the arrows and nodes—for example, using red for low float
values, yellow for medium float values, and green for high float values. You can
partition the three float value ranges in several ways:
• Relative criticality. Relative criticality divides the maximum value of the float
of all activities in the network into three equal parts. For example, if the maxi-
mum float is 45 days, then red would be 1 to 15 days, yellow would be 16 to 30
days, and green would be 31 to 45 days of float. This technique works well if the
maximum float is a large number (such as greater than 30 days) and the floats
are uniformly distributed up to that number.
• Exponential criticality. Relative criticality assumes that the risk for delay is
somewhat equally spread across the range of float. In reality, an activity with 5
days of float is much more likely to derail the project than an activity with 10
days of float, even though both may be classified as red by relative criticality. To
address this issue, the exponential criticality divides the range of the maximum
float into three unequal, exponentially smaller ranges. I recommend making the
divisions at 1/9 and 1/3 of the range: These divisions are reasonably sized but
more aggressive than those produced by 1/4 and 1/2, and the divisors are pro-
portional to the number of colors. For example, if the maximum float is 45 days,
then red would be 1 to 5 days, yellow would be 6 to 15 days, and green would
be 16 to 45 days of float. As with relative criticality, the exponential criticality
works well if the maximum total float is a large number (such as greater than 30
days) and the floats are uniformly distributed up to that number.
• Absolute criticality. The absolute criticality classification is independent of both
the value of the maximum float and how uniformly the floats are distributed
along the range. The absolute criticality sets an absolute float range for each
color classification. For example, red activities would be those with 1 to 9 days
of float, yellow would be 10 to 26 days of float, and green activities would be 27
days of float (or more). Absolute criticality classification is straightforward and
204 C HAPTER 8 N ET WORK AND F LOAT
works well in most projects. The downside is that it may require customizing
the ranges to the project at hand to reflect the risk. For example, 10 days may
be green in a 2-month project but red in a year-long project.
Figure 8-7 shows the same network as in Figure 8-6 with color coding for absolute
criticality classification using the absolute float ranges just suggested. The critical
activities in black have no float.
4 13
1 5 15
5 8 10 25 5
2 5
10 11 14 16
10
3 6
30 9
30 30 12
7
5
30
Compare the ease with which you can interpret the visual information in Figure 8-7
versus the same textual information in Figure 8-6. You can immediately see that
the second part of the project is risky.
FLOATS-BASED SCHEDULING
As stated in Chapter 7, the safest and most efficient way to assign resources to
activities is based on float—or, given the defi nition of this chapter, total float.
This is the safest method because you address the riskier activities fi rst, and it
is the most efficient because you maximize the percentage of time for which the
resources are utilized.
Consider the snapshot of a scheduling chart shown in Figure 8-8. Here the length
of each colored bar represents the time-scaled duration of that activity, and the
right or left position is aligned with the schedule.
4
Developer 4
3
Developer 3
2
Developer 2
1
Developer 1
The figure has four activities: 1, 2, 3, 4. All activities are ready to start on the same
date. Due to downstream activities (not shown), activity 1 is critical, while activi-
ties 2, 3, and 4 have various levels of total float indicated by their color coding: 2
is red (low float), 3 is yellow (medium float), and 4 is green (high float). Suppose
these are all development activities that all developers can perform equally well,
and that there are no task continuity issues. When staffing this project, you fi rst
must assign a developer to the critical activity 1. If you have a second developer,
you should assign that developer to activity 2, which has the lowest float of all
other activities. This way, you can utilize as many as four developers, working on
each of the activities as soon as possible.
Alternatively, you can staff the project with only two developers (see Figure 8-9).
As before, the fi rst developer works on activity 1. The second developer starts
with activity 2 as soon as possible because there is no point in postponing a near-
critical activity and making it critical.
Once activity 2 is complete, the second developer moves to the remaining activ-
ity with the lowest float, activity 3. This requires rescheduling 3 further down
the timeline, until the second developer is available after completing activity 2.
This is only possible by consuming (reducing) the float of activity 3; due to some
206 C HAPTER 8 N ET WORK AND F LOAT
4
Developer 2
3
Developer 2
2
Developer 2
1
Developer 1
This form of staffi ng trades float for resources and, in effect, for cost. When
assigning resources, you use floats in two ways: You assign resources to available
activities based on floats, low to high, and if needed, you consume activities’ float
to staff the projects with a smaller level of resources without delaying the project.
Note Pushing activity 3 down the timeline until the developer who
worked on activity 2 is available is identical to making activity 3 dependent
on activity 2. This is a good way of changing the network to reflect a de-
pendency on a resource. Recall from Chapter 7 that the network diagram is
not merely a network of activities but a network of dependencies, and that
resource dependencies are dependencies.
• Assure quality. Most teams incorrectly refer to their quality control and testing
activities as quality assurance (QA). True QA has little to do with testing. It
typically involves a single, senior expert who answers the question: What will it
take to assure quality? The answer must include how to orient the entire devel-
opment process to assure quality, how to prevent problems from ever happen-
ing, and how to track the root causes of problems and to fi x them. The presence
of a QA person is a sign of organizational maturity and is almost always indic-
ative of commitment to quality, of understanding that quality does not happen
on its own, and of acknowledging that the organization must actively pursue
quality. The QA person is sometimes responsible for designing the process and
207
208 C HAPTER 9 TIME AND C OST
authoring procedures for key phases. Since quality leads to productivity, proper
QA always accelerates the schedule, and it sets organizations that practice QA
apart from the rest of the industry.
• Employ test engineers. Test engineers are not testers, but rather full-fledged
software engineers who design and write code whose objective is to break the
system’s code. Test engineers, in general, are a higher caliber of engineer than
regular software engineers, because writing test engineering code often involves
more difficult tasks: developing fake communication channels; designing and
developing regression testing; and designing test rigs, simulators, automation,
and more. Test engineers are intimately familiar with the architecture and inner
workings of the system, and they take advantage of that knowledge to try to
break the system at every turn. Having such an “anti-system” system ready to
tear your product apart does wonders for quality because you can discover prob-
lems as soon as they occur, isolate root causes, avoid ripple effects of changes,
eliminate superimposition of defects masking other defects, and considerably
shorten the cycle time of fi xing problems. Having a constant, defect-free code-
base accelerates the schedule like nothing else ever does.
• Add software testers. In most teams, the developers outnumber the testers. In
projects with too few testers, the one or two testers cannot afford to scale up
with the team, and they are frequently reduced to performing testing that has
little added value. Such testing is repetitive, does not vary with the team size
or the system’s growing complexity, and often treats the system as a black box.
This does not mean that good testing does not take place, but rather that the
bulk of the testing is shifted onto developers. Changing the ratio of testers to
developers, such as 1:1 or even 2:1 (in favor of testers), allows the developers to
spend less time testing and more time adding direct value to the project.
• Invest in infrastructure. All software systems require common utilities in the
form of security, message queues and message bus, hosting, event publishing,
logging, instrumentation, diagnostics, and profi ling, as well as regression test-
ing and test automation. Modern software systems need configuration manage-
ment, deployment scripts, a build process, daily builds, and smoke tests (often
lumped under DevOps). Instead of having every developer write his or her own
unique infrastructure, you should invest in building (and maintaining) a frame-
work for the entire team that accomplishes most or all of the items listed here.
This focuses the developers on business-related coding tasks, provides great
economy of scale, makes it easier to onboard new developers, reduces stress and
friction, and decreases the time it takes to develop the system.
• Improve development skills. Today’s software environments are characterized
by a very high rate of change. This rate of change exceeds many developers’
ability to keep up with the latest language, tools, frameworks, cloud platforms,
A CCELER ATING S OFT WARE P ROJECTS 209
and other innovations. Even the best developers are perpetually coming to terms
with technology, and they spend an inordinate amount of time stumbling, figur-
ing things out in a nonstructured, haphazard way. Even worse, some developers
are so overwhelmed that they resort to copy-and-paste of code from the web,
without any real understanding of the short- or long-term implications (includ-
ing legal ones) of their actions. To ameliorate this problem, you should dedicate
the time and resources to train developers on the technology, methodology, and
tools at hand. Having competent developers will accelerate the time it takes to
develop any software.
• Improve the process. Most development environments suffer from a deficient
process. They go through the motions for the sake of doing the process, but lack
any real understanding or appreciation of the reasoning behind the activities.
There are no real benefits from these hollow activities, and they often make
things worse, in a cargo-cult culture1 manner. Volumes have been written about
software development processes. Educate yourself on the battle-proven best
practices and devise an improvement plan that will address the quality, sched-
ule, and budget issues. Sort the best practices in the improvement plan by effect
and ease of introduction, and proactively address the reasons why there were
absent in the fi rst place. Write standard operating procedures, and have the
team and yourself follow the standard operating procedures, and even enforce
them if necessary. Over time, this will make projects more repeatable and able
to deliver on the set schedule.
• Adopt and employ standards. A comprehensive coding standard addresses
naming conventions and style, coding practices, project settings and structure,
framework-specific guidelines, your own guidelines, your team’s dos and don’ts,
and known pitfalls. The standard helps to enforce development best practices
and to avoid mistakes, elevating novices to the level of veterans. It makes the
code uniform and eases any issues created when one developer works on anoth-
er’s code. By complying with the standard, developers increase the chances of
success and decrease the time it would otherwise take to develop the system.
• Provide access to external experts. Most teams will not have world-class experts
as members. The team’s job is to understand the business and deliver the sys-
tem, not to be very good in security, hosting, UX, cloud, AI, BI, Big Data, or
database architecture. Reinventing the wheel is very time-consuming and is
never as good as accessing readily available, proven knowledge (recall the 2%
problem from Chapter 2). It is far better and faster to defer to external experts.
Use these experts at specific places as required and avoid costly mistakes.
1. https://en.wikipedia.org/wiki/Cargo_cult
210 C HAPTER 9 TIME AND C OST
• Engage in peer reviews. The best debugger is the human eye. Developers often
detect problems in each other’s code much faster than it takes to diagnose and
eliminate the problems once the code is part of the system. This is also true
when it comes to defects in the requirements or in the design and test plan of
each of the services in the system. The team should review all of these to ensure
the highest-quality codebase.
These software engineering best practices will accelerate the project as a whole, irre-
spective of specific activities or the project network itself. They are effective in any
project, in any environment, and with any technology. While it might appear costly
to improve the project this way, it can very well end up costing less. The reduction in
the time it takes to develop the system pays for the cost of the improvements.
SCHEDULE COMPRESSION
The issue with the items in the previous list of schedule acceleration techniques
is that none is a quick fi x for the schedule; all of them take time to be effective.
However, you can do two things to immediately accelerate the schedule—either
work with better resources or fi nd ways of working in parallel. By employing these
techniques you will compress the schedule of the project. Such schedule compres-
sion does not mean doing the same work faster. Schedule compression means
accomplishing the same objectives faster, often by doing more work to fi nish the
task or the project sooner. You can use these two compression techniques in com-
bination with each other or in isolation, on parts of the project, on the project as a
whole, or on individual activities. Both compression techniques end up increasing
the direct cost (defi ned later) of the project while reducing the schedule.
code faster. Often, junior developers code much faster than senior developers.
Senior developers spend as little of their time as possible coding, instead spend-
ing the bulk of their time designing the code module, the interactions, and the
approaches they intend to use for testing. Senior developers write testing rigs,
simulators, and emulators for the components they are working on and for the ser-
vices they consume. They document their work, they contemplate the implications
of each coding decision, and they look at maintainability and extensibility of their
services, as well as other aspects such as security. Therefore, while per unit of time
such senior developers code more slowly than junior developers do, they complete
the task more quickly. As you might suspect, senior developers are at high demand
and command higher compensation than do junior developers. You should assign
these better resources only to critical activities since leveraging them outside the
critical path will not alter the schedule.
WORKING IN PARALLEL
In general, whenever you take a sequential set of activities and fi nd ways of per-
forming these activities in parallel, you accelerate the schedule. There are two
possible ways of working in parallel. The fi rst is by extracting internal phases of
an activity and moving them elsewhere in the project. The second way is by remov-
ing dependencies between activities so that you could work in parallel on these
activities (assigning multiple people to the same activity at the same time does not
work, as explained in Chapter 7).
Splitting Activities
Instead of performing the internal phases of an activity sequentially, you can split
up the activity. You schedule some of the less-dependent phases in parallel to
other activities in the project, either before or after the activity. Good candidates
for internal phases to extract upstream in the project (i.e., prior to the rest of the
activity) include the detailed design, documentation, emulators, service test plan,
service test harness, API design, UI design, and so on. Candidates for internal
phases to move downstream in the project include integration with other services,
unit testing, and repeated documentation. Splitting an activity reduces the time it
occupies on the critical path and shortens the project.
Removing Dependencies
Instead of working sequentially on dependent activities, you can look for ways to
reduce or even eliminate dependencies between activities and work on the activ-
ities in parallel. If the project has activity A that depends on activity B, which in
212 C HAPTER 9 TIME AND C OST
turn depends on activity C, the duration of the project would be the sum of the
durations of these three activities. However, if you could remove the dependency
between A and B, then you could work on A in parallel to B and C and compress
the schedule accordingly.
• Contract design. By having a separate design activity for a service contract, you
can provide the interface or the contract to its consumers and then start work-
ing on those before the service they depend upon is completed. Providing the
contract may not remove the dependency completely, but it could enable some
level of parallel work. The same goes for the design of UI, messages, APIs, or
protocols between subsystems or even systems.
• Emulators development. Given the contract design, you could write a simple
service that emulates the real service. Such implementation should be very sim-
ple (always returning the same results and without errors) and could further
remove the dependencies.
• Simulators development. Instead of a mere emulator, you could develop a com-
plete simulator to a service or services. The simulator could maintain state,
inject errors, and be indistinguishable from the real service. Sometimes writ-
ing a good simulator can be more difficult than constructing the real service.
However, the simulator does remove the dependency between the service and
its clients, allowing a high degree of parallel work.
• Repeated integration and testing. Even with a great simulator for a service,
a client developed against only that simulator should be a cause for concern.
Once the real service is completed, you must repeat the integration and testing
between that service and all clients that were developed against the simulator.
Consider the chart in Figure 9-1, which exhibits three pulses. In the original plan,
all three were done sequentially due to the dependencies between the outputs of
each pulse as the inputs to the next one. If you can fi nd some way of removing
those dependencies, you can work on one or two of the pulses in parallel to the
other one, significantly compressing the schedule.
S CHEDULE C OMPRESSION 213
12
Other Resources
Core Team
10
8
Staffing
0
01 03 05 07 09 11 01 03
Date
The team at hand may be incapable of parallel work for a variety of reasons (lack
of an architect, lack of senior developers, or an inadequate team size), forcing you
to resort to expensive high-grade, external talent. Even if you can afford the total
cost of the project, the parallel work will increase the cash flow rate and may make
the project unaffordable. In short, parallel work is not free.
That said, parallel work will reduce the overall time to market. When deciding on
pursuing a compressed parallel option, carefully weigh the incurred risks and cost
of parallel execution with the expected reduction in the schedule.
Time
money can deliver a 10-man-year project in a month (or, for that matter, in a year).
There is a natural limit to all compression efforts. Similarly, the time–cost curve of
Figure 9-2 indicates that given more time, the cost of the project goes down, while
(as discussed in Chapter 7) giving projects more time than is required actually
drives up their cost.
While the time–cost curve of Figure 9-2 is incorrect, it is possible to discuss points
on the time–cost curve that are present across all projects. These points are the result
of a few classic planning assumptions. Figure 9-3 shows the actual time–cost curve.
Full Compression
Minimum Duration
Direct Cost
C
om
pr
es
si
on
al
mic
cono
Une
Normal Solution
Time
Figure 9-3 Actual time–cost curve [Adopted and modified from James M. Antill and
Ronald W. Woodhead, Critical Path in Construction Practice, 4th ed. (Wiley, 1990).]
Compressed Solutions
You can compress the normal solution by using some or all of the compression
techniques described earlier in this chapter. While all of the resulting compressed
solutions are of shorter duration, they also cost more, likely in nonlinear ways.
Obviously, you should focus your compression effort only on activities on the critical
path, because compressing noncritical activities does nothing for the schedule. Each
compressed solution is to the left of the normal solution on the time–cost curve.
DISCRETE MODELING
The actual time–cost curve shown in Figure 9-3 offers infi nite points between
the normal solution and the minimum duration solution. Yet no one has the time
to design an infi nite number of project solutions, nor is there any need to do so.
Instead, the architect and the project manager must provide management with
one or two practical points in between the normal solution and the minimum
duration solution. These options represent some reasonable trade of time for cost
from which management can choose and are always be the result of some network
compression. As a result, the curve you actually produce during project design is
a discrete model, as shown in Figure 9-4. While the time–cost curve of Figure 9-4
has far fewer points than Figure 9-3, it has enough information to discern correctly
how the project behaves.
Full Compression
Minimum Duration
Direct Cost
Uneconomical
Compressed Solutions
Normal Solution
Time
Figure 9-4 Discrete time–cost curve [Adopted and modified from James M. Antill and
Ronald W. Woodhead, Critical Path in Construction Practice, 4th ed. (Wiley, 1990).]
Suppose schedule is the utmost priority, and the manager does not mind spending
whatever it takes to meet the commitment. The manager may think that it is possi-
ble to throw money and people on the project to push the team toward a deadline,
even though no amount of money can deliver below minimum duration.
It is also just as common to find managers with a constrained budget but more
amendable schedule. Such a manager may attempt to cut the cost by subcritically
staffing the project or by not giving the project the required resources. Doing so
pushes the project right of the normal solution into the uneconomical zone, again
causing it to cost much more.
PROJECT FEASIBILITY
The time–cost curve shows a paramount aspect of the project: feasibility. Project
design solutions of time and cost representing points at or above the curve are
doable. For example, consider the point A in Figure 9-5. The A solution calls for T2
time and C1 cost. While A is a feasible solution, it is suboptimal. If T2 is an accept-
able deadline, then the project could also be delivered for C2 cost, the value of the
time–cost curve at the time of T2. Because A is above the curve, it follows that C2
< C1. Conversely, if the cost of C1 is acceptable, then for the same cost it is also
possible to deliver the project in a time of T1, the value of the time–cost curve at
the time of C1. Since A is to the right of the curve, it follows that T1 < T2.
The points on the time–cost curve simply represent the most optimal trade of
time for cost. The time–cost curve is optimal because it is always better to deliver
the project faster (for the same cost) or at lower cost (for the same deadline). You
could do worse, but not better than the time–cost curve. This also implies that
points under the time–cost curve are impossible. For example, consider the point
B in Figure 9-6. The B solution calls for T3 time and C4 cost. However, to deliver
the project at time of T3 would require at least the cost of C3. Since B is below the
time–cost curve, it follows that C3 > C4. If C4 is all you can afford, then the proj-
ect would require at least the time of T4. Since B is left of the time–cost curve, it
follows that T4 > T3.
Direct Cost
A
C1
C2
T1 T2
Time
C3
C4
B
T3 T4
Time
Direct Cost
Feasible Solutions
Death Zone
Time
I cannot stress enough how important it is not to design a project in the death
zone. Projects in the death zone have failed before anyone writes the fi rst line of
code. The key to success is neither architecture nor technology, but rather avoiding
picking a project in the death zone.
For each iterative attempt at a normal solution, you progressively trade more float
for resources. This trade will naturally increase the risk of the project due to the
reduced float. It also means that the true normal solution already has a consider-
able level of risk. However, the lowest staffing level required for the true normal
solution is often good enough risk-wise, because sufficient float remains to meet
the project’s commitments.
TIME – C OST C URVE 221
Direct Cost
Normal
Attempts
Normal Solution
Time
Accommodating Reality
When looking for the staffing level of the normal solution, you should make minor
accommodations for reality. For example, in a year-long project, if you could avoid
hiring another resource by extending the schedule by a week, then you should
probably take that trade. It is also a good idea to fi nd ways of simplifying the
project execution or reducing its integration risk in exchange for a slight extension
of the duration or a small increase in cost. You should always prefer these accom-
modations for reality. As a result, the intermediate normal attempts may not be
aligned exactly vertically atop each other (as in Figure 9-8), but rather may drift a
little to the right or to the left.
DIRECT COST
The project’s direct cost comprises activities that add direct measurable value to the
project. These are the same explicit project activities shown in the project’s planned
earned value chart. As explained in Chapter 7, the planned earned value (and hence
the direct cost) varies over the project’s lifetime, resulting in a shallow S curve.
The direct cost of a software project typically includes the following items:
The direct cost curve of the project looks like Figure 9-3.
INDIRECT COST
The project’s indirect cost comprises activities that add indirect immeasurable
value to the project. Such activities are typically ongoing, and are not shown in the
earned value charts or the project plan.
The indirect cost of a software project typically includes the following items:
• The core team (i.e., architect, project manager, product manager) after the SDP
review
• Ongoing configuration management, daily build and daily test, or DevOps in
general
• Vacations and holidays
• Committed resources between assignments
P ROJECT C OST E LEMENTS 223
The indirect cost of most projects is largely proportional to the duration of the proj-
ect. The longer the project takes, the higher the indirect cost. If you were to plot the
indirect cost of the project over time, you should get roughly a straight line.
It is wrong to think of indirect cost as needless overhead. The project will fail
without a dedicated architect and a project manager, yet after the SDP review, they
have no explicit activities in the plan.
Given the defi nitions of direct and indirect costs, Figure 9-9 shows the two ele-
ments of cost and the resulting total cost of the project.
The indirect cost is shown as a straight line, and the direct cost curve is the
same curve shown in the previous figures. Both the direct and indirect curves in
Figure 9-9 are the product of discrete solutions, and the total cost curve is the sum
of the direct and indirect costs at each of these points.
Total Cost
Direct Cost
Indirect Cost
Cost
Time
Figure 9-9 Project direct, indirect, and total cost curves [Adopted and modified from
James M. Antill and Ronald W. Woodhead, Critical Path in Construction Practice, 4th ed.
(Wiley, 1990).]
In the early 1960s, General Electric developed the GE-225 computer, the world’s
fi rst commercial transistor-based computer.a The GE-225 project was a hotspot
of innovation. It introduced the world’s fi rst time-sharing operating system
(which influenced the design of all modern operating systems), direct memory
access, and the programming language BASIC.
In 1960, General Electric published a paper detailing its insights on the relation-
ship between time and cost from the GE-225 project.b That paper contained
the fi rst actual time–cost curve (similar to Figure 9-4) and the breakdown into
direct and indirect costs (as shown in Figure 9-9). These ideas were quickly
adopted as-is by the physical construction industry.c James Kelley, the co-author
of the fi rst paper on the critical path method, consulted on the GE-225 project.
Incidentally, the architect of the GE-225 was Arnold Spielbergd (the father of
the movie director Steven Spielberg). In 2006, the IEEE society recognized
Spielberg with the prestigious Computer Pioneer Award.
a. https://en.wikipedia.org/wiki/GE-200_series
b. Børge M. Christensen, “GE 225 and CPM for Precise Project Planning” (General Electric Company
Computer Department, December 1960).
c. James O’Brien, Scheduling Handbook (McGraw-Hill, 1969); and James M. Antill and Ronald W.
Woodhead, Critical Path in Construction Practice, 2nd ed. (Wiley, 1970).
d. https://www.ge.com/reports/jurassic-hardware-steven-spielbergs-father-was-a-computing-pioneer/
As an example, consider Figure 9-10. On the direct cost curve the normal solution
is clearly the point of minimum cost. However, on the total cost curve, the point of
minimum cost is the fi rst compressed solution to the left of the normal solution. In
this case, compressing the project has actually reduced the cost of the project. This
makes the point of minimum total cost of the project the optimal project design
option from a time–cost perspective because it completes the project faster than
normal and at a lower total cost.
Total Cost
Direct Cost
Indirect Cost
Minimum
Cost
Cost
Normal
Time
Figure 9-10 A high indirect cost shifts minimum total cost left of the normal solution
In Figure 9-10, the shift to the left of the point of minimum total cost is accentu-
ated because of the discrete nature of the charts. That said, there is always a shift
to the left, even with a continuous chart such as Figure 9-11, which has a lower
slope of the indirect cost line than that shown in Figure 9-10.
Because you will only ever develop a small set of project design solutions, the
time–cost curve you build will always be a discrete model of the project. With
P ROJECT C OST E LEMENTS 227
the solutions you have (and the level of indirect cost), the normal solution may
indeed be the point of minimum total cost, as in Figure 9-9. However, that out-
come is misleading and is simply an artifact of missing some unknown design
solution slightly left of normal.
Total Cost
Direct Cost
Indirect Cost
Cost
Minimum
Cost
Normal
Time
Figure 9-12 is the same as Figure 9-9 except that it adds that unknown point
immediately to the left of the normal solution to illustrate the shift to the left of
the point of minimum total cost.
The problem with this situation is that you have no idea how to make that solu-
tion: You do not know what combination of resources and compression yields this
point. While such a solution always exists in theory, in practice for most projects
you can equate the total cost of the normal solution with the minimum total cost
of the project. The difference between the total cost of the normal solution and
the point of true minimum total cost often does not justify the effort required to
fi nd that exact point.
228 C HAPTER 9 TIME AND C OST
Total Cost
Direct Cost
Indirect Cost
Cost
Minimum
Cost
Normal
Time
element by subtracting it from the total cost. For each project design solution, you
fi rst staff the project, then draw the planned earned value chart and the planned
staffing distribution chart. Next, you calculate the area under the staffing distri-
bution chart for the total cost, and you also sum up the effort across all direct cost
activities (the ones you show on the earned value chart). The indirect cost is simply
the difference between the two.
Graphically, Figure 9-13 shows a typical breakdown of the cost elements under the
staffing distribution chart (also refer to Figure 7-8). In the front end of the project,
only the core team is engaged, and much of that work involves indirect cost. The
rest of the effort expended by the core team does have some direct value, such
as designing the system and the project. However, past the SDP review, the core
team turns into pure indirect cost. After the SDP review, the project has additional
ongoing indirect costs such as DevOps, daily build, and daily tests. The rest of the
staffing is a direct cost, such as developers building the system.
Time
Unfortunately, many managers are simply unaware that shorter projects will cost
less, which leads to a classic mistake. When faced with a tight budget, the manager
will try to reduce the cost by throttling the resources (i.e., either the quality or the
quantity of the resources). This will make the project longer, so that it ends up
costing much more.
FIXED COST
Software projects have yet another element of cost that is fi xed with time. Fixed
cost might include computer hardware and software licenses. The fi xed cost of the
project is expressed as a constant shift up of the indirect cost line (Figure 9-14).
Total Cost
Direct Cost
Indirect Cost
Cost
Fixed Cost
Time
Because the fi xed cost merely shifts the total time–cost curve up, it adds nothing to
the decision-making process, as it affects all options almost equally (it may change
slightly with the team size). In most decent-size software projects, the fi xed cost
will be approximately 1–2% of the total cost, so it is typically negligible.
NETWORK COMPRESSION
Compressing the project will change the project network. Compression should be
an iterative process in which you constantly look for the best next step. You start
compressing the project from its normal solution. The normal solution should
respond well to compression because it is at the minimum of the time–cost curve.
As noted, initially the compression may even end up paying for itself. In addition,
immediately to the left of the normal solution, the time–cost curve is the most flat.
This means that your fi rst one or two compression points will provide the best
return on investment (ROI) of the compression cost. However, as you compress
the project further, you will start climbing up the time–cost curve, eventually
experiencing diminishing returns on the cost of compression. The project will
offer less and less reduction in schedule while incurring ever higher cost, as if the
project resists more compression. When compressing the project as a whole, you
should attempt to compound the effect by compressing a previously compressed
solution, not just trying a new compression technique on the baseline normal
solution.
COMPRESSION FLOW
You should avoid compressing activities that will not respond well to compression
regardless of the cost spent on them (such as architecture) or activities that are
already fully compressed. Since even individual activities have their own time–cost
curve, initially an activity may be easy to compress, but subsequent compression
will require additional cost to climb the activity’s own time–cost curve. At some
point the activity will be impossible to compress any further. For this reason, it
is better, in general, to compress other activities than to repeatedly compress the
same activity.
Ideally, you should compress only activities on the critical path. There is hardly
ever any point in compressing activities outside the critical path because doing
so will just drive the cost up without shortening the schedule. At the same time,
you should not blindly compress all activities on the critical path. The best can-
didates for compression are activities that offer the best ROI for the compression.
Compression of these activities will yield the most reduction in schedule for the
232 C HAPTER 9 TIME AND C OST
least additional cost. The duration of the activity also matters, because all com-
pression techniques are disruptive and will increase the risk and complexity of the
project. It is better to incur these effects on a large critical activity and gain the
most reduction in schedule. It is also generally advisable to split large activities
into smaller ones—a nice side effect of compressing a large activity.
As you compress the critical path, you will shorten it. As a result, another path
may now be the longest in the project network; that is, a new critical path emerges.
You should constantly evaluate the project network to detect the emergence of the
new critical path and compress that path instead of the old critical path. If multiple
critical paths arise, you must fi nd ways of compressing these concurrently and by
identical amounts. For example, if an activity or a set of activities caps all critical
paths, then the next compression iteration would target them.
You can keep repeatedly compressing the project until one of the following con-
ditions is met:
• You have met the desired deadline so there is so point in designing even more
expensive and shorter projects.
• The calculated cost of the project exceeds the budget set for the project.
• The compressed project network is so complex that it is unlikely any project
manager or team could deliver on it.
• The duration of the compressed solution is more than 30% (or even 25%)
shorter than that of the normal solution. As noted earlier, there is a natural
limit to how much in practice you can compress any project.
• The compressed solutions are too risky or risk is decreasing slightly because you
are past the point of maximum risk. This requires the ability to quantify the
risk of the project design solutions (discussed in the next chapter).
• You have run out of ideas or options for compressing the project any further.
There is nothing more to compress.
• Too many critical paths have emerged or all network paths have become critical.
• You can fi nd ways of compressing activities only outside the critical path. The
compressed solution is at the same duration as the previous one but is more
expensive. You have reached the full compression point of the project.
of time and cost. Often, it takes only two or three points left of the normal solu-
tion to understand how the project behaves. The more complex or expensive the
project, the more you should invest in understanding the project, because even
minute mistakes have drastic implications.
This page intentionally left blank
10
RISK
As demonstrated in Chapter 9, every project always has several design options that
offer different combinations of time and cost. Some of these options will likely
be more aggressive or riskier than other options. In essence, each project design
option is a point in a three-dimensional space whose axes are time, cost, and risk.
Decision makers should be able to take the risk into account when choosing a proj-
ect design option—in fact, they must be able to do so. When you design a project,
you must be able to quantify the risk of the options.
Most people recognize the risk axis but tend to ignore it since they cannot measure
or quantify it. This invariably leads to poor results caused by applying a two-di-
mensional model (time and cost) to a three-dimensional problem (time, cost, and
risk). This chapter explores how to measure risk objectively and easily using a few
modeling techniques. You will see how risk interacts with time and cost, how to
reduce the risk of the project, and how to fi nd the optimal design point for the
project.
CHOOSING OPTIONS
The ultimate objective of risk modeling is to weigh project design options in light
of risk as well as time and cost so as to evaluate the feasibility of these options. In
general, risk is the best criterion for choosing between options.
For example, consider two options for the same project: The fi rst option calls
for 12 months and 6 developers, and the second option calls for 18 months and
4 developers. If this is all that you know about the two options, most people will
235
236 C HAPTER 10 R ISK
choose the fi rst option since both options end up costing the same (6 man-years)
and the first option delivers much faster (provided you have the cash flow to afford
it). Now suppose you know the fi rst option has only a 15% chance of success
and the second option has a 70% chance of success. Which option would you
choose? As an even more extreme example, suppose the second option calls for
24 months and 6 developers with the same 70% chance of success. Although the
second option now costs twice as much and takes twice as long, most people will
intuitively choose that option. This is a simple demonstration that often people
choose an option based on risk, rather than based on time and cost.
PROSPECT THEORY
In 1979, the psychologists Daniel Kahneman and Amos Tversky developed
prospect theory,a one of the most important concepts in behavioral psychol-
ogy on decision making. Kahneman and Tversky discovered that people make
decisions based on the risk involved as opposed to the expected gain. Given a
measurable identical loss or gain, most people disproportionally suffer more
for the loss than they would enjoy for the same gain. As a result people seek
to reduce the risk as opposed to maximize gains, even when it would logically
be better to take the risk. This observation went against conventional wisdom
that held that people act rationally to maximize their gains based on expected
value. Prospect theory underscores the importance of adding risk to time and
cost in the decision-making process. In 2002, Daniel Kahneman won the Nobel
Memorial Prize in Economics for his work developing prospect theory.
a. Daniel Kahneman and Amos Tversky, “Prospect Theory: An Analysis of Decision under Risk,”
Econometrica, 47, no. 2 (March 1979): 263–292.
As you compress the project, the shorter project design solutions carry with them
an increased level of risk, and the rate of increase is likely nonlinear. This is why
the dashed line in Figure 10-1 curves up toward the vertical risk axis and relaxes
downward with time. However, this intuitive dashed line is wrong. In reality, a
time–risk curve is a logistic function of some kind, the solid line in Figure 10-1.
TIME –R ISK C URVE 237
Nonlinear Risk
Logistic Function Risk
Risk
Time
The logistic function is a superior model because it more closely captures the general
behavior of risk in complex systems. For example, if I were to plot the risk of me
burning dinner tonight due to compressing the normal preparation time, the risk
curve would look like the solid line in Figure 10-1. Each compression technique—
such as setting the oven temperature too high, placing the tray too close to the heat-
ing element, choosing easier-to-cook but more flammable food, not preheating the
oven, and so on—increases the risk of burning dinner. As shown by the solid line,
the risk of a burnt dinner due to the cumulative compression at some point is almost
maximized and even flattens out, because dinner is certain to burn. Similarly, if I
decide not to even enter the kitchen, then the risk would drop precipitously. If the
risk was dictated by the dashed line, I would always have some chance of not burn-
ing dinner since I could always keep increasing the risk by compressing it further.
Note that the logistic function has a tipping point where the risk drastically increases
(the analog to the decision to enter the kitchen). The dashed line, by contrast,
keeps increasing gradually and does not have a noticeable tipping point.
Direct Cost
Direct Cost Risk
Risk
Normal
Solution
Time
The vertical dashed line in Figure 10-2 indicates the duration of the normal solu-
tion as well as the minimum direct cost solution for the project. Note that the nor-
mal solution usually trades some amount of float to reduce staffing. The reduction
in float manifests in an elevated level of risk.
To the left of the normal solution are the shorter, compressed solutions. The com-
pressed solutions are also riskier, so the risk curve increases to the left of the nor-
mal solution. The risk rises and then levels off (as is the case with the ideal logistic
function). However, unlike the ideal behavior, the actual risk curve gets maximized
before the point of minimum duration and even drops a bit, giving it a concave
shape. While such behavior is counterintuitive, it occurs because in general, shorter
projects are somewhat safer, a phenomenon I call the da Vinci effect. When investi-
gating the tensile strength of wires, Leonardo da Vinci found that shorter wires are
stronger than longer wires (it is because the probability of a defect is proportional
to the length of the wire).1 In analogy, the same is true for projects. To illustrate the
point, consider two possible ways of delivering a 10-man-year project: 1 person for
10 years or 3650 people for 1 day. Assuming both are viable projects (that the people
are available, that you have the time, and so on), the 1-day project is much safer than
the 10-year project. The likelihood of something bad happening in a single day is
1. William B. Parsons, Engineers and Engineering in the Renaissance (Cambridge, MA: MIT Press,
1939); Jay R. Lund and Joseph P. Byrne, Leonardo da Vinci’s Tensile Strength Tests: Implications
for the Discovery of Engineering Mechanics (Department of Civil and Environmental Engineering,
University of California, Davis, July 2000).
R ISK M ODELING 239
open for debate, but it is a near certainty with 10 years. I provide a more quantified
explanation for this behavior later in this chapter.
To the right of the normal solution, the risk goes down, at least initially. For exam-
ple, giving an extra week to a one-year project will reduce the risk of not meeting
that commitment. However, if you keep giving the project more time, at some point
Parkinson’s law will take effect and drastically increase the risk. So, to the right of
the normal solution, the risk curve goes down, becomes minimized at some value
greater than zero, and then starts climbing again, giving it a convex shape.
RISK MODELING
This chapter presents my techniques for modeling and quantifying risk. These
models complement each other in how they measure the risk. You often need more
than one model to help you choose between options—no model is ever perfect.
However, each of the risk models should yield comparable results.
Risk values are always relative. For example, jumping off a fast-moving train is
risky. However, if that train is about to go over a cliff, jumping is the most sensible
thing to do. Risk has no absolute value, so you can evaluate it only in comparison
with other alternatives. You should therefore talk about a “riskier” project as
opposed to a “risky” project. Similarly, nothing is really safe. The only safe way of
doing any project is not doing it. You should therefore talk about a “safer” project
rather than a “safe” project.
NORMALIZING RISK
The whole point of evaluating risk is to be able to compare options and projects,
which requires comparing numbers. The fi rst decision I made when creating the
models was to normalize risk to the numerical range of 0 to 1.
A risk value of 0 does not mean that the project is risk-free. A risk value of 0 means
that you have minimized the risk of the project. Similarly, a risk value of 1 does not
mean that the project is guaranteed to fail, but simply that you have maximized
the risk of the project.
The risk value also does not indicate a probability of success. With probability, a
value of 1 means a certainty, and a value of 0 means an impossibility. A project with
a risk value of 1 can still deliver, and a project with a risk value of 0 can still fail.
240 C HAPTER 10 R ISK
Both of these options are valid project design options for building the same sys-
tem. The only information available in Figure 10-3 is the color-coded floats of
the two networks. Now, ask yourself: With which project would you rather be
involved? Everyone to whom I have shown these two charts preferred the greener
option on the right-hand side of Figure 10-3. What is interesting is that no one has
ever asked what the difference in duration and cost between these two options
was. Even when I volunteered that the greener option was both 30% longer and
more expensive, that information did not affect the preference. No one chose the
low-float, high-stress, and high-risk project shown on the left in Figure 10-3.
Design Risk
Your project faces multiple types of risk. There is staffing risk (Will the project actu-
ally gets the level of staffing it requires?). There is duration risk (Will the project be
allowed the duration it requires?). There is technological risk (Will the technology
be able to deliver?). There are human factors (Is the team technically competent
and can they work together?). There is always an execution risk (Can the project
manager execute correctly the project plan?).
These types of risk are independent of the kind of risk you assess using floats. Any
project design solution always assumes that the organization or the team will have
what it takes to deliver on the planned schedule and cost and that the project will
R ISK M ODELING 241
receive the required time and resources. The remaining type of risk pertains to
how well the project will handle the unforeseen. I call this kind of risk design risk.
Design risk assesses the project’s sensitivity to schedule slips of activities and to
your ability to meet your commitments. Design risk therefore quantifies the fra-
gility of the project or the degree to which the project resembles a house of cards.
Using floats to measure risk is actually quantifying that design risk.
CRITICALITY RISK
The criticality risk model attempts to quantify the intuitive impression of risk
when you evaluate the options of Figure 10-3. For this risk model you classify
activities in the project into four risk categories, from most to least risk:
• Critical activities. The critical activities are obviously the riskiest activities because
any delay with a critical activity always causes schedule and cost overruns.
• High risk activities. Low float, near-critical activities are also risky because any
delay in them is likely to cause schedule and cost overruns.
• Medium risk activities. Activities with a medium level of float have medium level
of risk and can sustain some delays.
• Low risk activities. Activities with high floats are the least risky and can sustain
even large delays without derailing the project.
You should exclude activities of zero duration (such as milestones and dummies)
from this analysis because they add nothing to the risk of the project. Moreover,
unlike real activities, they are simply artifacts of the project network.
Chapter 8 showed how to use color coding to classify activities based on their
float. You can use the same technique for evaluating the sensitivity or fragility of
activities by color coding the four risk categories. With the color coding in place,
assign a weight to the criticality of each activity. The weight acts as a risk factor.
You are, of course, at liberty to choose any weights that signify the difference in
risk. One possible allocation of weights is shown in Table 10-1.
242 C HAPTER 10 R ISK
Black (critical) 4
Substituting the weights from Table 10-1, the criticality risk formula is:
4 13
1 5 15
5 8 10 25 5
2 5
10 11 14 16
10
3 6
30 9
30 30 12
7
5
30
WC *N + WR *0 + WY *0 + WG *0 WC
Risk = = = 1.0
WC *N WC
The minimum value of the criticality risk is WG over WC; it occurs when all activ-
ities in the network are green. In such a network, NC, NR, and NY are zero, and NG
equals N:
WC *0 + WR *0 + WY *0 + WG *N WG
Risk = =
WC *N WC
Using the weights from Table 10-1, the minimum value of risk is 0.25. The critical-
ity risk, therefore, can never be zero: A weighted average such as this will always
have a minimum value greater than zero as long as the weights themselves are
greater than zero. This is not necessarily a bad thing, as the project risk should
never be zero. The formula implies the lowest range of risk values is too low to
achieve, which is reasonable since anything worth doing requires risk.
Choosing Weights
As long as you can rationalize your choice of weights, the criticality risk model
will likely work. For example, the set of weights [21, 22, 23, 24] is a poor choice
because 21 is only 14% smaller than 24; thus, this set does not emphasize the risk
of the green versus the critical activities. Furthermore, the minimum risk using
these weights (Wg / Wc) is 0.88, which is obviously too high. I fi nd the weights set [1,
2, 3, 4] to be as good as any other sensible choice.
244 C HAPTER 10 R ISK
FIBONACCI RISK
The Fibonacci series is a sequence of numbers in which every item in the series
equals the sum of the previous two, with the exception that the fi rst two values
are defi ned as 1.
Fibi = ϕ *Fibi-1
Since ancient times, ϕ has been known as the golden ratio. It is observed through-
out nature and human enterprises alike. Two famous (and quite disparate) exam-
ples based on the golden ratio are the way the invertebrate nautilus’s shell spirals
and the way markets retrace their former price levels.
Notice that the weights in Table 10-1 are similar to the beginning values of
the Fibonacci series. As an alternative to Table 10-1, you can choose any four
consecutive members from the Fibonacci series (such as [89, 144, 233, 377]) as
weights. Regardless of your choice, when you use them to evaluate the network in
R ISK M ODELING 245
Figure 10-4, the risk will always be 0.64 because the weights maintain the ratio of
ϕ. If WG is the weight of the green activities, the other weights are:
WY = ϕ*WG
WR = ϕ 2 *WG
WC = ϕ 3 *WG
Since WG appears in all elements of the numerator and the denominator, the equa-
tion can be simplified:
ϕ 3 *NC +ϕ 2 *NR +ϕ *N Y + NG
Risk =
ϕ 3 *N
ACTIVITY RISK
The criticality risk model uses broad risk categories. For example, if you defi ne
float greater than 25 days as green, then two activities—one with 30 days of float
and the other with 60 days of float—will be placed in the same green bin and will
have the same risk value. To better account for the risk contribution of each indi-
vidual activity, I created the activity risk model. This model is a far more discrete
than the criticality risk model.
246 C HAPTER 10 R ISK
F1 + ... + Fi + ... + FN ∑F i
Risk = 1 − = 1 − i=1
M*N M*N
where:
As with the criticality risk, you should exclude activities of zero duration (mile-
stones and dummies) from this analysis.
Applying the activity risk formula to the network in Figure 10-4 yields:
30 + 30 + 30 + 30 + 10 + 10 + 5 + 5 + 5 + 5
Risk = 1 − = 0.67
30 *16
F1 M 1
Risk ≈ 1 − = 1− = 1− ≈ 1 − 0 = 1.0
M*N M*N N
The minimum value of the activity risk is 0 when all activities in the network have
the same level of float, M:
∑M M*N
Risk = 1 − i=1
= 1− = 1 −1 = 0
M*N M*N
While activity risk can in theory reach zero, in practice it is unlikely that you will
encounter such a project because all projects always have some non-zero amount
of risk.
R ISK M ODELING 247
Calculation Pitfall
The activity risk model works well only when the floats of the projects are more
or less uniformly spread between the smallest float and the largest float in the
network. An outlier float value that is significantly higher than all other floats
will skew the calculation, producing an incorrectly high-risk value. For exam-
ple, consider a one-year project that has a single week-long activity that can
take place anywhere between the beginning and the end of the project. Such an
activity will have almost a year’s worth of float, as illustrated in the network in
Figure 10-5.
.
.
.
. Fi
.
.
Figure 10-5 shows the critical path (bold black) and many activities with some
color-coded level of float (Fi) below. The activity shown above the critical path
itself is short but has an enormous amount of float M.
Since M is much larger than any other Fi, the activity risk formula yields a number
approaching 1:
M >> Fi
N
∑F i
Fi *N F
Risk = 1 − i=1
≈ 1− ≈ 1 − i ≈ 1 − 0 = 1.0
M*N M*N M
The next chapter demonstrates this situation and provides an easy and effective
way of detecting and adjusting the float outliers.
The activity risk also produces an incorrectly low activity risk value when the
project does not have many activities and the floats of the noncritical activities
are all of similar or even have identical value. However, except for these rare,
somewhat contrived examples, the activity risk model measures the risk correctly.
248 C HAPTER 10 R ISK
Note When the activity risk and the criticality risk differ greatly, you
should determine the root cause. Perhaps the calibration of the criticality
risk was incorrect or the activity risk is skewed because the floats are not
spread uniformly. If nothing stands out, you may want to use the Fibonacci
risk model as the arbitrating risk model.
Figure 10-6 depicts two networks, with the bottom diagram being the compressed
version of the top diagram. The compressed solution has fewer critical activities,
a shorter critical path, and more noncritical activities in parallel. When measuring
the risk of such compressed projects, the presence of more of activities with float
and fewer critical activities will decrease the risk value produced by both the crit-
icality and activity risk models.
R ISK D ECOMPRESSION 249
EXECUTION RISK
While the design risk of a highly parallel project may be lower than the design
risk of a less compressed solution, such a project is more challenging to execute
because of the additional dependencies and the increased number of activities that
need to be scheduled and tracked. Such a project will have demanding scheduling
constraints and require a larger team. In essence, a highly compressed project
has converted design risk into execution risk. You should measure the execution
risk as well as the design risk. A good proxy for the expected execution risk is
the complexity of the network. Chapter 12 discusses how to quantify execution
complexity.
RISK DECOMPRESSION
While compressing the project is likely to increase the risk, the opposite is also true
(up to a point): By relaxing the project, you can decrease its risk. I call this tech-
nique risk decompression. You deliberately design the project for a later delivery
250 C HAPTER 10 R ISK
date by introducing float along the critical path. Risk decompression is the best
way to reduce the project’s fragility, its sensitivity to the unforeseen.
You should decompress the project when the available solutions are too risky.
Other reasons for decompressing the project include concerns about the present
prospects based on a poor past track record, facing too many unknowns, or a vol-
atile environment that keeps changing its priorities and resources.
At the same time, you should not over-decompress. Using the risk models, you can
measure the effect of the decompression and stop when you reach your decom-
pression target (discussed later in this section). Excessive decompression will have
diminishing returns when all activities have high float. Any additional decompres-
sion beyond this point will not reduce the design risk, but will increase the overall
overestimation risk and waste time.
You can decompress any project design solution, although you typically decom-
press only the normal solution. Decompression pushes the project a bit into the
uneconomical zone (see Figure 10-2), increasing the project’s time and cost. When
you decompress a project design solution, you still design it with the original staff-
ing. Do not be tempted to consume the additional decompression float and reduce
the staff —that defeats the purpose of risk decompression in the fi rst place.
HOW TO DECOMPRESS
A straightforward way of decompressing the project is to push the last activity or
the last event in the project down the timeline. This adds float to all prior activities
in the network. In the case of the network depicted in Figure 10-4, decompressing
activity 16 by 10 days results in a criticality risk of 0.47 and an activity risk of
0.52. Decompressing activity 16 by 30 days results in a criticality risk of 0.3 and
an activity risk of 0.36.
DECOMPRESSION TARGET
When decompressing a project, you should strive to decompress until the risk
drops to 0.5. Figure 10-7 demonstrates this point on the ideal risk curve using a
logistic function with asymptotes at 1 and 0.
1.0
Risk
0.5
Decompression
Target
Time
When the project has a very short duration, the value of risk is almost 1.0, and
the risk is maximized. At that point the risk curve is almost flat. Initially, adding
time to the project does not reduce the risk by much. With more time, at some
point the risk curve starts descending, and the more time you give the project, the
steeper the curve gets. However, with even more time, the risk curve starts leveling
off, offering less reduction in risk for additional time. The point at which the risk
curve is the steepest is the point with the best return on the decompression—that
is, the most reduction in risk for the least amount of decompression. This point
defi nes the risk decompression target. Since the logistic function in Figure 10-7 is a
symmetric curve between 0 and 1, the tipping point is at a risk value of exactly 0.5.
To determine how the decompression target relates to cost, compare the actual
risk curve with the direct cost curve (Figure 10-8). The actual risk curve is con-
fi ned to a narrower range than the ideal risk curve and never approaches either
0 or 1, although it behaves similarly to a logistic function between its maximum
and minimum values. As discussed at the beginning of this chapter, the steepest
point of the risk curve (where concave becomes convex) is at minimum direct cost,
which coincides with the decompression target (Figure 10-8).
252 C HAPTER 10 R ISK
Direct Cost
1.0
Risk
Direct Cost
0.5
Risk
0.25
Minimum
Direct Cost Normal
Solution
Time
Since the risk keeps descending to the right of 0.5, you can think of 0.5 as a min-
imum decompression target. Again, you should monitor the behavior of the risk
curve and not over-decompress.
If the minimum direct cost point of the project is also the best point risk-wise, this
makes it the optimal design point for the project, offering the least direct cost at
the best risk. This point is neither too risky nor too safe, benefiting as much as
possible from adding time to the project.
Note In theory, the minimum direct cost point of the project coincides
with the normal solution and the steepest point on the risk curve. In prac-
tice, that is rarely the case because the model is fundamentally a discrete
model, and you usually make concessions for reality. Your normal solution
may be close to the minimum point of direct cost, but not exactly at it. This
means you often have to decompress the normal solution to the tipping
point of the risk curve.
R ISK M ETRICS 253
RISK METRICS
To end this chapter, here are a few easy-to-remember metrics and rules of thumb.
As is the case with every design metric, you should use them as guidelines. A
violation of the metrics is a red flag, and you should always investigate its cause.
• Keep risk between 0.3 and 0.75. Your project should never have extreme risk
values. Obviously, a risk value of 0 or 1.0 is nonsensical. The risk should not be
too low: Since the criticality risk model cannot go below 0.25, you can round
the lower possible limit of 0.25 up to 0.3 as the lower bound for any project.
When compressing the project, long before the risk gets to 1.0 (a fully critical
project), you should stop compressing. Even a risk value of 0.9 or 0.85 is still
high. If the bottom quarter of 0 to 0.25 is disallowed, then for symmetry’s sake
you should avoid the top quarter of risk values between 0.75 and 1.0.
• Decompress to 0.5. The ideal decompression target is a risk of 0.5, as it targets
the tipping point in the risk curve.
• Do not over-decompress. As discussed, decompression beyond the decompres-
sion target has dismissing returns, and over-decompression increases the risk.
• Keep normal solutions under 0.7. While elevated risk may be the price you pay
for a compressed solution, it is inadvisable for a normal solution. Returning to
the symmetry argument, if risk of 0.3 is the lower bound for all solutions, then
risk of 0.7 is the upper bound for a normal solution. You should always decom-
press high-risk normal solutions.
You should make both risk modeling and risk metrics part of your project design.
Constantly measure the risk to see where you are and where you are heading.
This page intentionally left blank
11
PROJECT DESIGN IN ACTION
The difficulty facing many project design novices is not the specific design tech-
niques and concepts, but rather the end-to-end flow of the design process. It is
also easy to get mired in the details and to lose sight of the objective of the design
effort. Without experience, you may be stumped when you encounter the fi rst snag
or situation that does not behave as prescribed. It is impractical to try to cover all
possible contingencies and responses. Instead, it is better to master the thought
process involved in project design.
This chapter demonstrates the thought process and the mindset via a comprehen-
sive walkthrough of the design effort. The emphasis is on the systematic examina-
tion of the steps and iterations. You will see observations and rules of thumb, how
to alternate between project design options, how to home in on what makes sense,
and how to evaluate tradeoffs. As this chapter evolves, it demonstrates ideas from
the previous chapters as well as the synergy gained by combining project design
techniques. It also covers additional aspects of project design such as planning
assumptions, complexity reduction, staffing and scheduling, accommodating con-
straints, compression, and risk and planning. As such, the objective of this chapter
is teaching project design flow and techniques, as opposed to providing a real-life
example.
THE MISSION
Your mission is to design a project to build a typical business system. This system
was designed using The Method, but that fact is immaterial in this chapter. In gen-
eral, the input to the project design effort should include the following ingredients:
• The static architecture. You use the static architecture to create the initial list
of coding activities.
• Call chains or sequence diagrams. You produce the call chains or sequence dia-
grams by examining the use cases and how they propagate through the system.
These provide the rough cut of structural activity dependencies.
255
256 C HAPTER 11 P ROJECT D ESIGN IN A CTION
• List of activities. You list all activities, coding and noncoding alike.
• Duration estimation. For each activity, you accurately estimate the duration
(and resources) involved (or work with others to do so).
• Planning assumptions. You capture the assumptions you have about staffing,
availability, ramp-up time, technology, quality, and so on. You typically will
have several such sets of assumptions, with each set resulting in a different project
design solution.
• Some constraints. You write down all the explicitly known constraints. You
should also include possible or likely constraints, and plan accordingly. You will
see multiple examples in this chapter for handling constraints.
Logging
Manager A Manager B
Pub/Sub
Resource Resource
A B
While the system in Figure 11-1 was inspired by a real system, the merits of this
particular architecture are irrelevant in this chapter. When designing the project,
you should avoid turning the project design effort into a system design review.
Even poor architectures should have adequate project design to maximize the
chance of meeting your commitments.
Client A Client B
Manager A Pub/Sub
Engine B Engine A
Resource Resource
A B
Client B
Pub/Sub Manager B
Engine B Engine C
ResAccess ResAccess
B C
Resource
B
Dependency Chart
You should examine the call chains, and lay out a fi rst draft of the dependencies
between components in the architecture. You start with all the arrows connect-
ing components, regardless of transport or connectivity, and consider each as
a dependency. You should account for any dependency exactly once. However,
typically the call chain diagrams do not show the full picture because they often
omit repeated implicit dependencies. In this case, all components of the architec-
ture (except the Resources) depend on Logging, and the Clients and Managers
depend on the Security component. Armed with that additional information,
you can draw the dependency chart shown in Figure 11-4.
THE M ISSION 259
Client A Client B
Pub/Sub
Manager A Manager B
Security
As you can see, even with a simple system having only two use cases, the depen-
dency chart is cluttered and hard to analyze. A simple technique you can leverage
to reduce the complexity is to eliminate dependencies that duplicate inherited
dependencies. Inherited dependencies are due to transitive dependencies1—those
dependencies that an activity implicitly inherits by depending on other activities.
In Figure 11-4, Client A depends on Manager A and Security; Manager A
also depends on Security. This means you can omit the dependency between
Client A and Security . Using inherited dependencies, you can reduce
Figure 11-4 to Figure 11-5.
1. https://en.wikipedia.org/wiki/Transitive_dependency
260 C HAPTER 11 P ROJECT D ESIGN IN A CTION
Client A Client B
Pub/Sub
Manager A Manager B
Security
LIST OF ACTIVITIES
While Figure 11-5 is certainly simpler than Figure 11-4, it is still inadequate
because it is highly structural in nature, showing only the coding activities. You
must compile a comprehensive list of all activities in the project. In this case, the
list of noncoding activities includes additional work on requirements, architecture
(such as technology verification or a demo service), project design, test plan, test
harness, and system testing. Table 11-1 lists all activities in the project, their dura-
tion estimation, and their dependencies on preceding activities.
THE M ISSION 261
1 Requirements 15
2 Architecture 20 1
3 Project Design 20 2
4 Test Plan 30 3
5 Test Harness 35 4
6 Logging 15 3
7 Security 20 3
8 Pub/Sub 5 3
9 Resource A 20 3
10 Resource B 15 3
11 ResourceAccess A 10 6,9
12 ResourceAccess B 5 6,10
13 ResourceAccess C 15 6
14 Engine A 20 12,13
15 Engine B 25 12,13
16 Engine C 15 6
17 Manager A 20 7,8,11,14,15
18 Manager B 25 7,8,15,16
20 Client App2 35 17
NETWORK DIAGRAM
With the list of activities and dependencies at hand, you can draw the project net-
work as an arrow diagram. Figure 11-6 shows the initial network diagram. The
numbers in this figure correspond to the activity IDs in Table 11-1. The bold lines
and numbers indicate the critical path.
262 C HAPTER 11 P ROJECT D ESIGN IN A CTION
7 8
1 2 3 10
Start M0
6
12
16
13
18
9 15
4 14
11 17 19
20
5
21
End
About Milestones
As defi ned in Chapter 8, a milestone is an event in the project denoting the com-
pletion of a significant part of the project, including major integration achieve-
ments. Even at this early stage in the project design, you should designate the
event completing Project Design (activity 3) as the SDP review milestone, M0.
In this case, M0 is the completion of the front end (short for fuzzy front end) of the
project, comprising requirements, architecture, and project design. This makes
the SDP review an explicit part of the plan. You can have milestones on or off
the critical path, and they can be public or private. Public milestones demonstrate
progress for management and customers, while private milestones are internal
hurdles for the team. If a milestone is outside the critical path, it is a good idea
to keep it private since it could move as a result of a delay somewhere upstream
from it. On the critical path, milestones can be both private and public, and they
correlate directly with meeting the commitments of the project in terms of both
time and cost. Another use for milestones is to force a dependency even if the
call chains do not specify such a dependency. The SDP review is such a mile-
stone—none of the construction activities should start before the SDP review.
Such forced-dependency milestones also simplify the network, and you will see
another example shortly.
Initial Duration
You can construct the network of activities listed in Table 11-1 in a project plan-
ning tool, which gives you a fi rst look at project duration. Doing so gives a duration
THE M ISSION 263
of 9.0 months for this project. However, without resource assignment, it is not yet
possible to determine the cost of the project.
Note The support files accompanying this book contain Microsoft Proj-
ect files and associated Excel spreadsheets for each permutation and itera-
tion discussed in this chapter. The following text contains explanation and
summary information only.
PLANNING ASSUMPTIONS
To proceed with the design, you itemize the planning assumptions, especially the
planned staffing requirements, in a list such as the following:
This list is, in fact, the list of resources you need to complete the project. Also note
the structure of the list: “one X for Y.” If you cannot state the required staffing this
way, you probably do not understand your own staffing requirements, or you are
missing a key planning assumption.
You should explicitly make two additional planning assumptions about the devel-
opers regarding testing time and idle time. First, in this example project, develop-
ers will produce such high-quality work that they will not be needed during system
testing. Second, developers between activities are considered a direct cost. Strictly
speaking, idle time should be accounted for as an indirect cost because it is not
associated with project activities, yet the project must pay for it. However, many
264 C HAPTER 11 P ROJECT D ESIGN IN A CTION
project managers strive to assign idle developers some activities in support of other
development activities, even if that means more than one developer is temporarily
assigned per service. Under this planning assumption, you still account for devel-
opers between activities as direct cost.
Project Phases
Each activity in a project always belongs to a phase, or a type of activities. Typical
phases include the front end, design, infrastructure, services, UI, testing, and
deployment, among others. A phase may contain any number of activities, and
the activities in a phase can overlap on the timeline. What is less obvious is that
the phases are not sequential and can themselves overlap or even start and stop.
The easiest way of laying out phases is to structure the planning assumptions list
into a role/phase table; Table 11-2 provides an example.
Architect X X X X
Project Manager X X X X
Product Manager X X X X
DevOps X X X
Developers X X
Testers X X
In much the same way, you could add other roles that are required for the duration
of an entire phase, such as UX (user experience) or security experts. However, you
should not include roles that are required only for specific activities, such as the
test engineer.
Table 11-2 is a crude form of a staffing distribution view. The relationship between
roles and phases is essential when building the staffing distribution chart because
you must account for the use of all the resources, regardless of whether they are
assigned to specific project activities. For example, in Table 11-2, an architect is
required throughout the project. In turn, in the staffing distribution chart, you
would show the architect across the duration of the project. In this way, you can
account for all resources necessary to produce the correct staffi ng distribution
chart and cost calculation.
F INDING THE N ORMAL S OLUTION 265
Note Rarely will someone hand you the planning assumptions on a silver
platter, as is the case in this chapter. Some form of discovery, back-and-forth,
and negotiation always takes place at the front end of the project as you try
to distill your specific planning assumptions. You can even reverse this flow:
Start with your take on the planning assumptions and staffing distributions,
captured as suggested here, and then ask for feedback and comments.
Following the process outlined in Chapter 7, Figure 11-8 shows the corresponding
project staffing distribution chart. This plan uses as many as four developers and
two database architects, uses one test engineer, and does not consume any float.
The calculated project total cost is 58.3 man-months.
Recall from Chapter 9 that fi nding the normal solution is an iterative process (see
Figure 9-8) simply because the lowest level of staffing is not known at the begin-
ning of the design effort. Therefore, this first set of results is not yet the normal
solution. In the next iteration you should accommodate reality, consume float to
decrease staffing volatility, address any obvious design flaws, and reduce complex-
ity if possible.
266 C HAPTER 11 P ROJECT D ESIGN IN A CTION
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
10%
0%
1/28 2/27 3/29 4/28 5/28 6/27 7/27 8/26 9/25 10/25
Date
DBA
Test Engineer
12 Developers
Testers
DevOps
10
Architect
Product Manager
8 Project Manager
Staffing
0
1/28 2/18 3/18 4/15 5/3 5/10 5/24 6/21 7/1 7/26 8/2 9/6 9/16
Date
The last problem with the solution so far is the integration pressure on the Manager
services. From Table 11-1 and the network diagram of Figure 11-6, you can see
that the Managers (activities 17 and 18) are expected to integrate with four or five
other services. Ideally, you should integrate only one or two services at a time.
Integrating more than two services concurrently will likely result in a nonlinear
increase in complexity because any issues across services will be superimposed on
each other. The problem is further compounded because the integrations occur
toward the end of the project, when you have little runway left to fi x issues.
Completing the infrastructure first reduces the complexity in the network (decreases
the number of dependencies and crossing lines) and alleviates the integration pres-
sure at the Managers. Overriding the original dependencies in this way typically
reduces the initial staffing demand because none of the other services can start until
M1 is complete. It also reduces the staffing volatility and usually results in a smoother
staffing distribution and a gradual ramp-up at the beginning of the project.
268 C HAPTER 11 P ROJECT D ESIGN IN A CTION
M1
7 Security
Logging 6 8 Pub/Sub
M0
Developing the infrastructure fi rst changes the initial staffing to three developers
(one per service) until after M1, at which point the project can absorb a fourth
developer (note you are still working on an staffing plan with unlimited resources).
Repeating the prior steps, the infrastructure fi rst plan extends the schedule by
3% to 9.2 months and incurs 2% of additional total cost, to 59 man-months. In
exchange for the negligible additional cost and schedule, the project gains early
access to key services and a simpler, more realistic plan. Going forward, this new
project becomes the baseline for the next iteration.
LIMITED RESOURCES
The resources you ask for may not always be available when you need them, so it
is prudent to plan for fewer resources (at least initially) to mitigate that risk. How
will the project behave if three developers are unavailable at the beginning of the
project? If no developers at all are available, the architect can develop the infra-
structure, or the project can engage subcontractors: The infrastructure services do
not require domain knowledge, so they are good candidates for such external and
readily available resources. If only one developer is available at the beginning, then
that single developer can do all infrastructure components serially. Perhaps only a
single developer is available initially, and then a second developer can join in after
the fi rst activity is complete.
F INDING THE N ORMAL S OLUTION 269
DBA
Testers
Test Engineer
Developers
12 DevOps
Architect
10
Product Manager
Project Manager
8
Staffing
0
1/28 2/18 3/18 4/15 5/6 5/24 5/31 6/7 6/21 7/12 7/26 8/23 8/30 10/4 10/14
Date
Extending the critical path by limiting the resources also increases the float of the
noncritical activities that span that section of the network. Compared with unlim-
ited resources, the float of the Test Plan (activity 4 in Figure 11-6) and Test
Harness (activity 5 in Figure 11-6) is increased by 30%, the float of Resource A
(activity 9 in Figure 11-6) is increased by 50%, and the float of Resource B (activ-
ity 10 in Figure 11-6) is increased by 100%. This is noteworthy because a seem-
ingly minute change in resource availability has increased the float dramatically.
Be aware that this knife can cut both ways: Sometimes a seemingly innocuous
change can cause the floats to collapse and derail the project.
270 C HAPTER 11 P ROJECT D ESIGN IN A CTION
Testers
Test Engineer
Developers
12 DevOps
Architect
10
Product Manager
Project Manager
8
Staffing
0
1/28 2/18 3/18 4/15 5/6 5/24 5/31 6/7 6/21 7/12 7/26 8/23 8/30 10/4 10/14
Date
Figure 11-11 Staffing distribution with three developers and one test engineer
Note the use of the test engineer along with the three developers. Figure 11-11 is
the best staffing distribution so far, looking very much like the expected pattern
from Chapter 7 (see Figure 7-8).
F INDING THE N ORMAL S OLUTION 271
Figure 11-12 shows the shallow S curve of the planned earned value. The figure
shows a fairly smooth shallow S curve. If anything, the shallow S is almost too
shallow. You will see the meaning of that later on.
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
10%
0%
1/28 2/27 3/29 4/28 5/28 6/27 7/27 8/26 9/25 10/25 11/22
Date
Figure 11-12 Planned earned value with three developers and one test engineer
Figure 11-13 shows the corresponding network diagram, using the absolute crit-
icality float color-coding scheme described in Chapter 8. This example project
uses 9 days as the upper limit for red activities and 26 days as the upper limit for
the yellow activities. The activity IDs appear above the arrows in black, and the
float values are shown below the line in the arrow’s color. The test engineer’s
activities—that is, the Test Plan (activity 4) and the Test Harness (activity
5)—have a very high float of 65 days. Note the M0 milestone terminating the front
end and the M1 milestone at the end of the infrastructure. The diagram also shows
the phasing in of the resources between M0 and M1 to build the infrastructure
(activities 6, 7, and 8).
272 C HAPTER 11 P ROJECT D ESIGN IN A CTION
8
15
6
1 2 3 7
Start M0
10 M1 16
30 13
30
12 18
15
25 9 10 5
65 4 14
5
11 17 5 19
25 20
5
65 21
End
Figure 11-13 Network diagram with three developers and one test engineer
Caution Every software project should have a test engineer. Test engi-
neers are so crucial to success that you should consider letting go of devel-
opers before you give up on professional test engineering.
(and specifically the old critical path) no longer applies. You must therefore redraw
the network diagram to reflect the dependency on the two developers.
Recall from Chapter 7 that resource dependencies are dependencies and that the
project network is a dependency network, not just an activity network. You there-
fore add the dependency on the resources to the network. You actually have some
flexibility in designing the network: As long as the natural dependency between
the activities is satisfied, the actual order of the activities can vary. To create the new
network, you assign the two resources, as always, based on float. Each developer
takes on the next lowest-float activity available after fi nishing with the current
activity. At the same time, you add a dependency between the developer’s next
activity and the current one to reflect the dependency on the developer. Figure 11-14
shows the subcritical network diagram for the example project.
8 10
7 9
11
14
6 M1
13
17
1 2 3
Start M0 12
15 20
16
18
4
19
5
21
End
Given that only two developers are performing most of the work, the subcritical
network diagram looks like two long strings. One string of activities is the long
critical path; the other string is the second developer back-fi lling on the side. This
long critical path increases the risk to the project because the project now has more
critical activities. In general, subcritical projects are always high-risk projects.
In the extreme case of having only a single developer, all activities in the project
are critical, the network diagram is one long string, and the risk is 1.0. The dura-
tion of the project equates to the sum of all activities, but, due to the maximum
risk, even that duration is likely to be exceeded.
274 C HAPTER 11 P ROJECT D ESIGN IN A CTION
In the extreme case of only a single developer doing all the work, the planned
earned value is a straight line. In general, a lack of curvature in the planned earned
value chart is a telltale sign for a subcritical project. Even the somewhat ane-
mic shallow S curve in Figure 11-12 indicates the project is close to becoming
subcritical.
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
10%
0%
1/28 3/14 4/28 6/12 7/27 9/10 10/25 12/9 1/23 3/7
Date
• This solution complies with the defi nition of the normal solution by utiliz-
ing the lowest level of resources that allows the project to progress unimpeded
along the critical path.
• This solution works around limitations of access to experts such as database
architects while not compromising on a key resource, the test engineer.
• This solution does not expect all the developers to start working at once.
• Both the staffing distribution chart and the planned earned value chart exhibit
acceptable behavior.
The front end of this solution encompasses, as expected, 25% of the duration
of the project, and the project has an acceptable efficiency of 23%. Recall from
Chapter 7 that the efficiency number should not exceed 25% for most projects.
The rest of the chapter uses Iteration 5 as the normal solution and as the baseline
for the other iterations. Table 11-3 summarizes the various project metrics of the
normal solution.
276 C HAPTER 11 P ROJECT D ESIGN IN A CTION
Peak staffi ng 9
Efficiency 23%
NETWORK COMPRESSION
With the normal solution in place, you can try to compress the project and see
how well certain compression techniques work. There is no single correct way
of compressing a project. You will have to make assumptions about availability,
complexity, and cost. Chapter 9 discussed a variety of compression techniques. In
general, the best strategy is to start with the easier ways of compressing the proj-
ect. For demonstration purposes, this chapter shows how to compress the project
using several techniques. Your specific case will be different. You may choose to
apply only a few of the techniques and the ideas discussed here, weighing carefully
the implications of each compressed solution.
Ideally you would assign such a resource only on the critical path, but that is not
always possible (recall the discussion of task continuity from Chapter 7). The normal
baseline solution assigns two developers to the critical path, and your goal is to replace
one of them with the top resource. To identify which one, you should consider both
the number of activities and the number of days spent on the critical path per person.
Table 11-4 lists the two developers in the normal solution who touch the critical
path, the number of critical activities versus noncritical activities each has, and
the total duration on the critical path and off the critical path. Clearly, it is best to
replace Developer 2 with the top developer.
Developer 1 4 85 2 35
Developer 2 1 5 4 95
Next, you need to revisit Table 11-1 (the duration estimations for each activity),
identify the activities for which Developer 2 is responsible, and adjust their duration
downward by 30% (the expected productivity gain with the better resource) using
5-day resolution. With the new activity durations, repeat the project duration and
cost analysis while accounting for the additional 80% markup for Developer 2.
Figure 11-16 shows the critical path on the network diagram before and after com-
pressing with the top developer.
The new project duration is 9.5 months, only 4% shorter than the duration of the
original normal solution. The difference is so small because a new critical path
has emerged, and that new path holds the project back. Such miniscule reduction
in duration is a fairly common result. Even if that single resource is vastly more
productive than the other team members, and even if all activities assigned to that
top resource are done much faster, the durations of the activities assigned to reg-
ular team members are unaffected by the top resource, and those activities simply
stifle the compression.
In terms of cost, the compressed project cost is unchanged, despite having a top
resource that costs 80% more. This, too, is expected because of the indirect cost.
Most software projects have a high indirect cost. Reducing the duration even by a
little tends to pay for the cost of compression, at least with the initial compression
attempts, because the minimum of the total cost curve is to the left of the normal
solution (see Figure 9-10).
278 C HAPTER 11 P ROJECT D ESIGN IN A CTION
11 17 19
20
5
21
End
Low-Hanging Fruit
The best candidates for parallel work in most well-designed systems are the infra-
structure and the Client designs because both are independent of the business logic.
Earlier, you saw this independence play out in Iteration 2, which pushed the infra-
structure to start immediately after the SDP review. To enable parallel work with
the Clients, you split the Clients into separate design and development activities.
Such Client-related design activities typically include the UX design, the UI design,
and the API or SDK design (for external system interactions). Splitting the Clients
also supports better separation of Client designs from the back-end system because
N ET WORK C OMPRESSION 279
the Clients should provide the best experience for the consumers of the services, not
merely reflect the underlying system. You can now move the infrastructure develop-
ment and the Client design activities to be parallel to the front end.
This move has two downsides, however. The lesser downside is the higher initial
burn rate, which increases simply because you need developers as well as the core
team at the beginning. The larger downside is that starting the work before the
organization is committed to the project tends to make the organization decide to
proceed, even if the smart thing to do is to cancel the project. It is human nature
to disregard the sunk cost or to have an anchoring bias2 attached to shining UI
mockups.
I recommend moving the infrastructure and the Client designs to the front end
only if the project is guaranteed to proceed and the purpose of the SDP review is
solely to select which option to pursue (and to sign off on the project). You could
mitigate the risk of biasing the SDP decision by moving only the infrastructure
development to be in parallel to the front end, proceeding with the Client designs
after the SDP review. Finally, make sure that the Client design activities are not
misconstrued as significant progress by those who equate UI artifacts with prog-
ress. You should combine the work in the front end with project tracking (see
Appendix A) to ensure decision makers correctly interpret the status of the project.
2. https://en.wikipedia.org/wiki/Anchoring
280 C HAPTER 11 P ROJECT D ESIGN IN A CTION
There is no set formula for this kind of parallel work. You could do it for a few
key activities or for most activities. You could perform the additional activities
up-front or on-the-go. Very quickly you will realize that eliminating all dependen-
cies between coding activities is practically impossible because there are diminish-
ing returns on compression when all paths are near-critical. You will be climbing
the direct cost curve of the project, which, near the minimum duration point (see
Figure 9-3), is characterized by a steep slope, requiring even more cost for less and
less reduction in schedule.
Table 11-5 lists the revised set of activities, their duration, and their dependencies
for this compression iteration.
Note that Logging (activity 6), and therefore the rest of the infrastructure activi-
ties, along with the new Client design activities (activities 24 and 25), can start at
the beginning of the project. Note also that the actual Client development activ-
ities (activities 19 and 20) are shorter now and depend on the completion of the
respective Client design activities.
N ET WORK C OMPRESSION 281
1 Requirements 15
2 Architecture 20 1
3 Project Design 20 2
4 Test Plan 30 22
5 Test Harness 35 4
6 Logging 10
7 Security 15 6
8 Pub/Sub 5 6
9 Resource A 20 22
10 Resource B 15 22
11 ResourceAccess A 10 9,23
12 ResourceAccess B 5 10,23
13 ResourceAccess C 10 22,23
14 EngineA 15 12,13
15 EngineB 20 12,13
16 EngineC 10 22,23
17 ManagerA 15 14,15,11
18 ManagerB 20 15,16
22 M0 0 3
23 M1 0 7,8
Several potential issues arise when compressing this project by moving the infra-
structure and the Client design activities to the front end. The fi rst challenge is
cost. The duration of the front end now exceeds the duration of the infrastructure
and the Client design activities, even when done serially by the same resources.
Therefore, starting that work simultaneously with the front end is wasteful because
the developers will be idle toward the end. It is more economical to defer the start
of the infrastructure and the Client designs until they become critical. This will
increase the risk of the project, but reduce the cost while still compressing the
project.
In this iteration, you can reduce the cost further by using the same two develop-
ers to develop the infrastructure fi rst; after completing the infrastructure, they
follow with the Client design activities. Since resource dependencies are depen-
dencies, you make the Client design activities depend on the completion of the
infrastructure (M1). To maximize the compression, the two developers used in the
front end proceed to other project activities (the Resources) immediately after the
SDP review (M0). In this specific case, to calculate the floats correctly, you also
make the SDP review dependent on the completion of the Client design activities.
This removes the dependency on the Client design activities from the Clients
themselves and allows the Clients to inherit the dependency from the SDP review
instead. Again, you can afford to override the dependencies of the network in this
case only because the front end is longer than the infrastructure and the Client
design activities combined. Table 11-6 shows the revised dependencies of the net-
work (changes noted in red).
Table 11-6 Revised dependencies with infrastructure and Client designs first
ID Activity Duration Depends On
(days)
1 Requirements 15
… … … …
22 M0 0 3,24,25
23 M1 0 7,8
The other challenge with splitting activities is the increased complexity of the
Clients as a whole. You could compensate for that complexity by assigning the
Client design activities and development to the same two top developers from the
previous compression iteration. This compounds the effect of compression with
top resources. However, since the Clients and the project are now more complex
and demanding, you should further compensate for that by assuming there is no
30% reduction in the time it takes to build the Clients (but the developers still cost
80% more). These compensations are already reflected in the duration estimation
of activities 19 and 20 in Table 11-5 and Table 11-6.
The result of this compression iteration is a cost increase of 6% from the previ-
ous solution to 62.6 man-months and a schedule reduction of 8% to 7.8 months.
Figure 11-17 shows the resulting network diagram.
M1
24
7
5
8
10 25
6
1 2 3
Start M0
13 16
10
10 20 40
12 18
9 15
10 10
20 4 10 14
15
11 17 5 19
20
5
20 21
End
Figure 11-17 Network diagram for infrastructure and Client designs first
Note Figure 11-17 has a long chain of activities [10, 12, 15, 18, 19] that
has a mere 5 to 10 days of float. Given the length of the project, you should
consider long chains with only 5 (or even 10) days of float as critical paths
when calculating risk.
284 C HAPTER 11 P ROJECT D ESIGN IN A CTION
Compressing the Clients becomes possible when you develop simulators (see
Chapter 9) for the Manager services on which they depend and move the devel-
opment of the Clients somewhere upstream in the network, in parallel to other
activities. Since no simulator is ever a perfect replacement for the real service,
you also need to add explicit integration activities between the Clients and the
Managers, once the Managers are complete. This in effect splits each Client
development into two activities: The fi rst is a development activity against the
simulators, and the second is an integration activity against the Managers. As
such, the Clients development may not be compressed, but the overall project
duration is shortened.
You could mimic this approach by developing simulators for the Engines and
ResourceAccess services on which the Managers depend, which enables develop-
ment of the Managers earlier in the project. However, in a well-designed system
and project, this would usually be far more difficult. Although simulating the
underlying services would require many more simulators and make the project
network very complex, the real issue is timing. The development of these simula-
tors would have to take place more or less concurrently with the development of
the very services they are supposed to simulate, so the actual compression you can
realize from this approach is limited. You should consider simulators for the inner
services only as a last resort.
In this example project, the best approach is to simulate the Managers only. You
can compound the previous compression iteration (infrastructure and Client designs
at the front end) by compressing it with simulators. A few new planning assump-
tions apply when compressing this iteration:
• Dependencies. The simulators could start after the front end, and they also
require the infrastructure. This is inherited with a dependency on M0 (activity 22).
N ET WORK C OMPRESSION 285
Table 11-7 lists the activities and the changes to dependencies, while using the pre-
vious iteration as the baseline solution and incorporating its planning assumptions
(changes noted in red).
1 Requirements 15
… … … …
17 ManagerA 15 …
18 ManagerB 20 …
… … … …
26 ManagerA Simulator 15 22
27 ManagerB Simulator 20 22
29 Client App2 20 26
Figure 11-18 shows the resulting staffing distribution chart. You can clearly see the
sharp jump in the developers after the front end and the near-constant utilization
of the resources. The average staffing in this solution is 8.9 people, with peak staff-
ing of 11 people. Compared with the previous compression iteration, the simula-
tors solution results in a 9% reduction in duration to 7.1 months, but increases the
total cost by only 1% to 63.5 man-months. This small cost increase is due to the
reduction in the indirect cost and the increased efficiency and expected throughput
of the team when working in parallel.
286 C HAPTER 11 P ROJECT D ESIGN IN A CTION
Test Engineer
Developers
Testers
DevOps
Architect
Product Manager
12
Project Manager
10
8
Staffing
0
11/28 2/18 3/4 3/15 3/22 4/5 4/15 5/3 5/10 5/24 5/31 6/10 6/24 7/12 7/19 8/30
Date
Figure 11-19 shows the simulators solution network diagram. You can see the high
float for the simulators (activities 26 and 27) and Client development (activities
28 and 29). Also observe that virtually all other network paths are critical or
near critical and there is high integration pressure toward the end of the project.
This solution is fragile to the unforeseen, and the network complexity drastically
increases the execution risk.
M1
24
7
5
8 26
10 25 30
6
1 2 3 27 28
Start M0
30 30
13 16 29
10 30
5 15 35
12 18
9 15
5 5
5 4 5 14
15 17
11 19
20
5
5 21
End
THROUGHPUT ANALYSIS
It is important to recognize how compression affects the expected throughput of
the team compared with the normal solution. As explained in Chapter 7, the pitch
of the shallow S curve represents the throughput of the team. Figure 11-20 plots
the shallow S curves of the planned earned value for the normal solution and each
of the compressed solutions on the same scale.
As expected, the compressed solutions have a steeper shallow S since they com-
plete sooner. You can quantify the difference in the required throughput by replac-
ing each curve with its respective linear regression trend line and examining the
equation of the line (see Figure 11-21).
The trend lines are straight lines, so the coefficient of the x term is the pitch of the
line and, therefore, the expected throughput of the team. In the case of the nor-
mal solution, the team is expected to operate at 39 units of productivity, while the
simulators solution calls for 59 units of productivity (0.0039 versus 0.0059, scaled
to integers). The exact nature of these units of productivity is immaterial. What
is important is the difference between the two solutions: The simulators solution
expects a 51% increase in the throughput of the team (59 – 39 = 20, which is 51%
of 39). It is unlikely that any team, even by increasing its size, could increase its
throughput by such a large factor.
288 C HAPTER 11 P ROJECT D ESIGN IN A CTION
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
Normal
TopDev2
20%
TopDev2+TopDev1
Infra+Clients Front End
10%
Simulators
0%
1/28 2/27 3/29 4/28 5/28 6/27 7/27 8/26 9/25 10/25 11/22
Date
100%
90%
y = 0.0059x – 242.8
80%
y = 0.0039x – 160.72
70%
60%
Earned Value
50%
40%
30%
Normal
TopDev2
20%
TopDev2+TopDev1
Infra+Clients Front End
10%
Simulators
0%
1/28 2/27 3/29 4/28 5/28 6/27 7/27 8/26 9/25 10/25 11/22
Date
Figure 11-21 Earned value trend lines for the project solutions
E FFICIENCY A NALYSIS 289
EFFICIENCY ANALYSIS
The efficiency of each project design solution is a fairly easy number to calculate—
and a very telling one. Recall from Chapter 7 that the efficiency number indicates
both the expected efficiency of the team and how realistic the design assump-
tions are regarding constraints, staffing elasticity, and criticality of the project.
Figure 11-22 shows the project solutions efficiency chart for the example project.
25%
Normal
Simulators TopDev2+
TopDev1
TopDev2
20%
Infra+Clients
Front End
Subcritical
15%
Efficiency
10%
5%
0%
6 7 8 9 10 11 12 13 14 15
Duration
Observe in Figure 11-22 that peak efficiency is at the normal solution, resulting
from the lowest level of resources utilization without any compression cost. As you
compress the project, efficiency declines. While the simulators solution is on par
with the normal solution, I consider it unrealistic since the project is much more
290 C HAPTER 11 P ROJECT D ESIGN IN A CTION
complex and its feasibility is in question (as indicated by the throughput analysis).
The subcritical solution is awful when it comes to efficiency due to the poor ratio
of direct cost to the indirect cost. In short, the normal solution is the most efficient.
Table 11-8 Duration, total cost, and cost elements for the various options
Design Option Duration Total Cost Direct Cost Indirect Cost
(months) (man-months) (man-months) (man-months)
With these cost numbers, you can produce the project time–cost curves shown in
Figure 11-23. Note that the direct cost curve is a bit flat due to the scaling of the
chart. The indirect cost is almost a perfect straight line.
90
Total Cost
Subcritical
80 Direct Cost
70 TopDev2+ Indirect Cost
Simulators
TopDev1 Normal
60
Infra+Clients TopDev2
50 Front End
Cost
40
30
20
10
0
6 7 8 9 10 11 12 13 14 15
Duration
90
y = 0.90x 2 – 16.32x + 134.25 Total Cost
80 R 2 = 0.99 Direct Cost
70 Indirect Cost
Total Cost (Model)
60 Direct Cost (Model)
50 Indirect Cost (Model)
Cost
y = 4.45x – 4.01
40
R 2 = 0.98
30
20
y = 0.65x 2 – 15.60x + 112.64
10
R 2 = 0.99
0
6 7 8 9 10 11 12 13 14 15
Duration
Figure 11-24 provides the equations for how cost changes with time in the example
project. For the direct and indirect costs, the equations are:
where t is measured in months. While you also have a correlation model for the
total cost, that model is produced by a statistical calculation, so it is not a perfect
292 C HAPTER 11 P ROJECT D ESIGN IN A CTION
sum of the direct and indirect costs. You produce the correct model for the total
cost by simply adding the equations of the direct and indirect models:
90
0
Total Cost
80 Direct Cost
Indirect Cost
70
60
50
Cost
40
30
20
10
0
6 7 8 9 10 11 12 13 14 15
Duration
Identifying the death zone allows you to answer intelligently quick questions and
avoid committing to impossible projects. For example, suppose management asks
if you could build the example project in 9 months with 4 people. According to the
total cost model of the project, a 9-month project costs more than 60 man-months
and requires an average of 7 people:
P L ANNING AND R ISK 293
90
0
85
80
75 Feasible Solutions
Cost 70
65
60
55
50 Death Zone
45
40
6 7 8 9 10 11 12 13 14 15
Duration
Assuming the same ratio of average-to-peak staffing as the normal solution (68%),
a solution that delivers at 9 months peaks at 10 people. Any fewer than 10 people
causes the project at times to go subcritical. The combination of 4 people and 9
months (even when utilized at 100% efficiency 100% of the time) is 36 man-months
of cost. That particular time–cost coordinate is not even visible in Figure 11-26
because it is so deep within the death zone. You should present these fi ndings to
management and ask if they want to commit under these terms.
Figure 11-27 plots the risk levels of the project design options along with the direct
cost curve. The figure offers both good and bad news regarding risk. The good news
is that criticality risk and activity risk track closely together in this project. It is
always a good sign when different models concur on the numbers, giving credence
to the values. The bad news is that all project design solutions so far are high-risk
options; even worse, they are all similar in value. This means the risk is elevated
and uniform regardless of the solution. Another problem is that Figure 11-27 con-
tains the subcritical point. The subcritical solution is defi nitely a solution to avoid,
and you should remove it from this and any subsequent analysis.
In general, you should avoid basing your modeling on poor design options. To
address the high risk, you should decompress the project.
50 1.0
Direct Cost
TopDev2+ 0.9
45 Simulators Activity Risk
TopDev1 Normal Subcritical
0.8 Criticality Risk
40 Infra+Clients 0.7
Direct Cost
0.5
30
0.4
25 0.3
0.2
20
0.1
15 0.0
6 7 8 9 10 11 12 13 14 15
Duration
RISK DECOMPRESSION
Since in this example project all project design solutions are of high risk, you
should decompress the normal solution and inject float along the critical path
until the risk drops to an acceptable level. Decompressing a project is an iterative
process because you do not know up front by how much to decompress or how
well the project will respond to decompression.
Somewhat arbitrarily, the fi rst iteration decompresses the project by 3.5 months,
from the 9.9 months of the normal solution to the furthest point of the subcritical
solution. This result reveals how the project responds across the entire range of
solution durations. Doing so produces a decompression point called D1 (total proj-
ect duration of 13.4 months) with criticality risk of 0.29 and activity risk of 0.39.
As explained in Chapter 10, 0.3 should be the lowest risk level for any project,
which implies this iteration overly decompressed the project.
The next iteration decompresses the project by 2 months from the normal duration,
roughly half the decompression amount of D1. This produces D2 (total project
duration of 12 months). The criticality risk is unchanged from 0.29, because these
2 months of decompression are still larger than the lower limit used in this project
for green activities. Activity risk is increased to 0.49.
Adjusting Outliers
Figure 11-28 features a conspicuous gap between the two risk models. This dif-
ference is due to a limitation of the activity risk model—namely, the activity risk
model does not compute the risk values correctly when the floats in the project are
not spread uniformly (see Chapter 10 for more details). In the case of the decom-
pressed solutions, the high float values of the test plan and the test harness skew
the activity risk values higher. These high float values are more than one standard
deviation removed from the average of all the floats, making them outliers.
When computing activity risk at the decompression points, you can adjust the
input by replacing the float of the outlier activities with the average of all floats
plus one standard deviation of all floats. Using a spreadsheet, you can easily auto-
mate the adjustment of the outliers. Such an adjustment typically makes the risk
models correlate more closely.
296 C HAPTER 11 P ROJECT D ESIGN IN A CTION
1.0
Activity Risk
0.9 TopDev2+
Simulators TopDev1 Normal Criticality Risk
0.8
D4
0.7 Infra+Clients
Front End TopDev2 D3
0.6
D2
Risk
0.5
D1
0.4
0.3
0.2
0.1
0.0
6 7 8 9 10 11 12 13 14 15
Duration
Figure 11-29 shows the adjusted activity risk curve along with the criticality risk
curve. As you can see, the two risk models now concur.
1.0
Activity Risk
0.9 TopDev2+
Simulators TopDev1 Normal Criticality Risk
0.8
0.7 Infra+Clients
Front End TopDev2
0.6 D4
Risk
0.5 D3
0.4 D2
D1
0.3
0.2
0.1
0.0
6 7 8 9 10 11 12 13 14 15
Duration
In this example project, the indirect cost model is a straight line, and you can
safely extrapolate the indirect cost from that of the other project design solution
(excluding the subcritical solution). For example, the extrapolation for D1 yields
an indirect cost of 51.1 man-months.
The direct cost extrapolation requires dealing with the effect of delays. The addi-
tional direct cost (beyond the normal solution that was used to create the decom-
pressed solutions) comes from both the longer critical path and the longer idle
time between noncritical activities. Because staffing is not fully dynamic or elastic,
when a delay occurs, it often means people on other chains are idle, waiting for the
critical activities to catch up.
In the example project’s normal solution, after the front end, the direct cost mostly
consists of developers. The other contributors to direct cost are the test engineer
activities and the fi nal system testing. Since the test engineer has very large float,
you can assume that the test engineer will not be affected by the schedule slip. The
staffing distribution for the normal solution (shown in Figure 11-11) indicates that
staffing peaks at 3 developers (and even that peak is not maintained for long) and
goes as low as 1 developer. From Table 11-3, you can see that the normal solution
uses 2.3 developers, on average. You can therefore assume that the decompression
affects two developers. One of them consumes the extra decompression float, and
the other one ends up idle.
The planning assumptions in this project stipulate that developers between activ-
ities are accounted for as a direct cost. Thus, when the project slips, the slip adds
the direct cost of two developers times the difference in duration between the nor-
mal solution and the decompression point. In the case of the furthest decompres-
sion point D1 (at 13.4 months), the difference in duration from the normal solution
(at 9.9 months) is 3.5 months, so the additional direct cost is 7 man-months. Since
the normal solution has 21.8 man-months for its direct cost, the direct cost at D1
is 28.8 man-months. You can add the other decompression points by performing
similar calculations. Figure 11-30 shows the modified direct-cost curve along with
the risk curves.
298 C HAPTER 11 P ROJECT D ESIGN IN A CTION
50 1.0
Direct Cost
TopDev2+ Activity Risk 0.9
45 Simulators TopDev1 Normal
Criticality Risk 0.8
40 Infra+Clients 0.7
Direct Cost
Risk
D3 0.5
30
D2 0.4
D1 0.3
25
0.2
20
0.1
15 0.0
6 7 8 9 10 11 12 13 14 15
Duration
Figure 11-30 Modified direct cost curve and the risk curves
These curves have a R 2 of 0.99, indicating an excellent fit to the data points.
Figure 11-31 shows the new time–cost curves models as well as the points of min-
imum total cost and the normal solution.
With a better total cost formula now known, you can calculate the point of mini-
mum total cost for the project. The total cost model is a second order polynomial
of the form:
y = ax 2 + bx + c
P L ANNING AND R ISK 299
Recall from calculus that the minimum point of such a polynomial is when its fi rst
derivative is zero:
y ' = 2 ax + b = 0
b −17.78
xmin = − = − = 9.0
2a 2 * 0.99
As discussed in Chapter 9, the point of minimum total cost always shifts to the
left of the normal solution. While the exact solution of minimum total cost is
unknown, Chapter 9 suggested that for most projects fi nding that point is not
worth the effort. Instead, you can, for simplicity’s sake, equate the total cost of
the normal solution with the minimum total cost for the project. In this case, the
minimum total cost is 60.3 man-months and the total cost of the normal solution
according to the model is 61.2 man-months, a difference of 1.5%. Clearly, the
simplification assumption is justified in this case. If minimizing the total cost is
your goal, then both the normal solution and the fi rst compression solution with
a single top developer are viable options.
90
0
Total Cost
80 Direct Cost
Indirect Cost
70 Minimum
Total Cost Normal
60
50
Cost
40
30 Minimum
Direct Cost
20
10
0
6 7 8 9 10 11 12 13 14 15
Duration
the normal solution is at 9.9 months. The discrepancy is partially due to the dif-
ferences between a discrete model of the project and the continuous model (see
Figure 11-30, where the normal solution is also minimum direct cost, versus Figure
11-31). A more meaningful reason is that the point has shifted due to rebuilding
the time–cost curve to accommodate the risk decompression points. In practice,
the normal solution is often offset a little from the point of minimum direct cost
due to accommodating constraints. The rest of this chapter uses the duration of
10.8 months as the exact point of minimum direct cost.
MODELING RISK
You can now create trend line models for the discrete risk models, as shown in
Figure 11-32. In this figure, the two trend lines are fairly similar. The rest of the
chapter uses the activity risk trend line because it is more conservative: It is higher
across almost all of the range of options.
1.0
Activity Risk
0.9 Criticality Risk
0.8 Activity Risk (Model)
Criticality Risk (Model)
0.7
R 2 = 0.94
0.5
0.4
0.1
0.0
6 7 8 9 10 11 12 13 14 15
Duration
Fitting a polynomial correlation model, you now have a formula for risk in the
project:
Note The small coefficient of the first term in the polynomial combined
with its high degree (third) means that for this risk formula, more precision
is required. While not shown in the text, all the remaining calculations in
this chapter use eight decimal digits.
With the risk formula you can plot the risk model side by side with the direct cost
model, as in Figure 11-33.
50 1.0
Direct Cost
0.9
45 Activity Risk
0.8
40 0.7
Direct Cost
0.6
35
Risk
0.5
30
0.4
25 0.3
0.2
20
Minimum D3 0.1
Direct Cost
15 0.0
6 7 8 9 10 11 12 13 14 15
Duration
Recall from Chapter 10 that ideally the minimum direct cost should be at 0.5
risk and that this point is the recommended decompression target. The example
project is off that mark by 4%. While this project does not have project design
solution with a duration of exactly 10.8 months, the known D3 decompression
point comes close, with a duration of 10.9 months (see the dashed red lines in
Figure 11-33). In a practical sense, these points are identical.
D3 the optimal point in terms of direct cost, duration, and risk. The total cost at
D3 is only 63.8 man-months, virtually the same as the minimum total cost. This
also makes D3 the optimal point in terms of total cost, duration, and risk.
Being the optimal point means that the project design option has the highest prob-
ability of delivering on the commitments of the plan (the very defi nition of suc-
cess). You should always strive to design around the point of minimum direct cost.
Figure 11-34 visualizes the float of the project network at D3. As you can see, the
network is a picture of health.
8
35
6
1 2 3 20 7
Start M0 20
20 20 20
10 M1 16
50 13
50
12 20 18
15
45 9 30 25
85 4 20 14
25
11 17 25 19
45 20
20
5 20
85 21
End
Minimum Risk
Using the risk formula, you can also calculate the point of minimum risk. This point
comes at 12.98 months and a risk value of 0.248. Chapter 10 explained that the
minimum risk value for the criticality risk model is 0.25 (using the weights [1, 2, 3,
4]). While 0.248 is very close to 0.25, it was produced using the activity risk formula,
which, unlike the criticality risk model, is unaffected by the choice of weights.
Using the risk model, Figure 11-35 shows how all the project design solutions map
to the risk curve of the project. You can see that the second compressed solution is
almost at maximum risk, and that the more compressed solutions have the expected
decreased level of risk (the da Vinci effect introduced in Chapter 10). Obviously,
designing anything to or past the point of maximum risk is ill advised. You should
avoid even approaching the point of maximum risk for the project—but where is
that cutoff point? The example project has a maximum risk value of 0.85 on the risk
curve, so project design solutions approaching that number are not good options.
Chapter 10 suggested 0.75 as the maximum level of risk for any solution. With risk
higher than 0.75, the project is typically fragile and likely to slip the schedule.
1.0
Infra+Clients
0.9 Front End TopDev2+TopDev1
0.8 Simulators TopDev2
0.75
0.7 Normal
D4
0.6
Minimum Direct Cost
Risk
0.5 D3
0.4
D2
0.3 0.3
D1
0.2
0.1
0.0
6 7 8 9 10 11 12 13 14 15
Duration
Using the risk formula, you will fi nd that the point of 0.75 risk is at duration of
9.49 months. While no project design solution exactly matches this point, the fi rst
compression point has a duration of 9.5 months and risk of 0.75. This suggests
that the first compression is the upper practical limit in this example project. As
discussed previously, 0.3 should be the lowest level of risk, which excludes the D1
decompression point at 0.27 risk. The D2 decompression point at 0.32 is possible,
but borderline.
SDP REVIEW
All the detailed project design work culminates in the SDP review, where you
present the project design options to the decision makers. You should not only
drive educated decisions, but also make the right choice obvious. The best project
304 C HAPTER 11 P ROJECT D ESIGN IN A CTION
design option so far was D3, a one-month decompression offering both minimum
cost and risk of 0.50.
When presenting your results to decision makers, list the first compression point,
the normal solution, and the optimal one-month decompression from the normal
solution. Table 11-10 summarizes these viable project design options.
Table 11-11 lists the project options you should present at the SDP review. Note the
rounded schedule and cost numbers. The rounding was performed with a bit of a
license to create a more prominent spread. While this will not change anything in
the decision-making process, it does lend more credibility to the numbers.
The risk numbers are not rounded because risk is the best way of evaluating the
options. It is nearly certain that the decision makers have never seen risk in a
SDP R EVIEW 305
quantified form as a tool for driving educated decisions. You must explain to them
that the risk values are nonlinear; that is, using the numbers from Table 11-11,
0.68 risk is a lot more risky than 0.5 risk, not a mere 36% increase. To illustrate
nonlinear behavior, you can use an analogy between risk and a more familiar non-
linear domain, the Richter scale for earthquake strength. If the risk numbers were
levels of earthquakes on the Richter scale, an earthquake of 6.8 is 500 times more
powerful than an earthquake of 5.0 magnitude, and a 7.5 quake is 5623 times more
powerful. This sort of plain analogy steers the decision toward the desired point
of 0.50 risk.
This page intentionally left blank
12
ADVANCED TECHNIQUES
Project design is an intricate discipline, and the previous chapters covered only the
basic concepts. This was deliberate, to allow for a moderate learning curve. There
is more to project design, and in this chapter you will fi nd additional techniques
useful in almost any project, not just the very large or the very complex. What
these techniques have in common is that they give you a better handle on risk and
complexity. You will also see how to successfully handle even the most challenging
and complex projects.
GOD ACTIVITIES
As the name implies, god activities are activities too large for your project. “Too
large” could be a relative term, when a god activity is too large with respect to
other activities in the project. A simple criterion for such a god activity is having
an activity with a duration that differs by at least one standard deviation from the
average duration of all the activities in the project. But god activities can be too
large in absolute respect. Durations in the 40–60 days range (or longer) are too
large for a typical software project.
Your intuition and experience may already tell you to avoid such activities.
Typically, god activities are mere placeholders for some great uncertainty lurking
under the cover. The duration and effort estimation for a god activity is almost
always low-grade. Consequently, the actual activity duration may be longer, poten-
tially enough to derail the project. You should confront such dangers as soon as
possible to ensure that you have a chance of meeting your commitments.
God activities also tend to deform the project design techniques shown in this
book. They are almost always part of the critical path, rendering most critical
path management techniques ineffective because the critical path’s duration and
its position in the network gravitate toward the god activities. To make matters
worse, the risk models for projects with god activities result in misleadingly low
risk numbers. Most of the effort in such a project will be spent on the critical
307
308 C HAPTER 12 A DVANCED TECHNIQUES
god activities, making the project for all practical purposes a high-risk project.
However, the risk calculation will be skewed lower because the other activities
orbiting the critical god activities will have high float. If you removed these satel-
lite activities, the risk number would shoot up toward 1.0, correctly indicating the
high risk resulting from the god activities.
For instance, developing simulators for the god activities reduces other activities’
dependencies on the god activities themselves. This will enable working in parallel
to the god activities, making the god activities less (or maybe not at all) critical.
Simulators also reduce the uncertainty of the god activities by placing constraints
on them that reveal hidden assumptions, making the detailed design of the god
activities easier.
You should also consider ways of factoring the god activities into separate side
projects. Factoring into a side project is important especially if the internal phases
of the god activity are inherently sequential. This makes project management
and progress tracking much easier. You must design integration points along the
network to reduce the integration risk at the end. Extracting the god activities
this way tends to increase the risk in the rest of the project (the other activities
have much less float once the god activities are extracted). This is typically a good
thing because the project would otherwise have deceptively low risk numbers.
This situation is so common that low risk numbers are often a signal to look for
god activities.
In Figure 11-33, at the point of minimum direct cost and immediately to its left,
the direct cost curve is basically flat, but the risk curve is steep. This is an expected
behavior because the risk curve typically reaches its maximum value before the
direct cost reaches its maximum value with the most compressed solutions. The
only way to achieve maximum risk before maximum direct cost is if, initially, left
of minimum direct cost, the risk curve rises much faster than the direct cost curve.
At the point of maximum risk (and a bit to its right), the risk curve is flat or almost
flat, while the direct cost curve is fairly steep.
It follows that there must be a point left of minimum direct cost where the risk
curve stops rising faster than the direct cost curve. I call that point the risk cross-
over point. At the crossover point, the risk approaches its maximum. This indi-
cates you should probably avoid compressed solutions with risk values above the
crossover. In most projects, the risk crossover point will coincide with the value of
0.75 on the risk curve.
The risk crossover point is a conservative point both because it is not at maximum
risk and because it is based on the behavior of the risk and direct cost, rather than
an absolute value of risk. That said, given the track record of most software proj-
ects, a bit of caution is never a bad thing.
The rate of growth of a curve is expressed by its fi rst derivative, so you have to
compare the first derivative of the risk curve with the first derivative of the direct
cost curve. The risk model in the example project of Chapter 11 is in the form of
a polynomial of the third degree with the following form:
y = ax 3 + bx 2 + cx + d
There are two issues you need to overcome before you can compare the two deriv-
ative equations. The fi rst issue is that the ranges of values between maximum risk
and minimum direct cost in both curves are monotonically decreasing (meaning
the rates of growth of the two curves will be negative numbers), so you must com-
pare the absolute values of the rates of growth. The second issue is that the raw
rates of growth are incompatible in magnitude. The risk values range between 0
and 1, while the cost values are approximately 30 for the example project. To cor-
rectly compare the two derivatives, you must first scale the risk values to the cost
values at the point of maximum risk.
R(tmr )
F=
C(tmr )
where:
The risk curve is maximized when the fi rst derivative of the risk curve, R', is zero.
Solving the project’s risk equation for t when R' = 0 yields a tmr of 8.3 months.
The corresponding risk value, R, is 0.85, and the corresponding direct cost value is
R ISK C ROSSOVER P OINT 311
28 man-months. The ratio between these two values, F, is 32.93, the scaling factor
for the example project.
The acceptable risk level for the project occurs when all of the following condi-
tions are met:
You can put these conditions together in the form of this expression:
Using the equations for the risk and direct cost derivatives as well as the scaling
factor yields:
The result is not one, but two crossover points, at 9.03 and 12.31 months.
Figure 12-1 visualizes the behavior of the scaled risk and cost derivatives in abso-
lute value. You can clearly see that the risk derivative in absolute value crosses over
the cost derivative in absolute value in two places (hence crossover points).
Math aside, the reason why there are two risk crossover points has to do with
the semantics of the points from a project design perspective. At 9.03 months, the
risk is 0.81; at 12.31 months, the risk is 0.28. Superimposing these values on the
risk curve and the direct cost curve in Figure 12-2 reveals the true meaning of the
crossover points.
Project design solutions to the left of the 9.03-month risk crossover point are too
risky. Project design solutions to the right of the 12.31-month risk crossover point
are too safe. In between the two risk crossover points, the risk is “just right.”
312 C HAPTER 12 A DVANCED TECHNIQUES
12
Direct Cost Derivative (Absolute Value)
Activity Risk Derivative (Absolute Value)
10
0
8 9 10 11 12 13
Duration
50 1.0
Direct Cost
0.9
45 Activity Risk
0.8
40 0.7
Direct Cost
0.6
35 Risk
0.5
30
0.4
25 0.3
0.2
20 Too Acceptable Too 0.1
Risky Risk Safe
15 0.0
6 7 8 9 10 11 12 13 14 15
Duration
If you have plotted the risk curve, you can see where that tipping point is located,
and if you have one, select a decompression point at the tipping point, or more
conservatively, to its right. This technique was used in Chapter 11 to recommend
D3 in Figure 11-29 as the decompression target. However, merely eyeballing a
chart is not a good engineering practice. Instead, you should apply elementary
calculus to identify the decompression target in a consistent and objective manner.
Given that the risk curve emulates a standard logistic function (at least between
minimum and maximum risk), the steepest point in the curve also marks a twist
or inflection point in the curve. To the left of that point the risk curve is concave,
and to the right of it the risk curve is convex. Calculus tells us that at the inflec-
tion point, where concave becomes convex, the second derivative of the curve is
zero. The ideal risk curve and its fi rst two derivatives are shown graphically in
Figure 12-3.
0.5
Using the example project from Chapter 11 to demonstrate this technique, you have
the risk equation as polynomial of the third degree. Its first and second derivatives are:
y = ax 3 + bx 2 + cx + d
y ' = 3ax 2 + 2bx + c
y '' = 6 ax + 2b
b
x=−
3a
Since the risk model is:
−0.36
t =− = 10.62
3* 0.01
At 10.62 months, the risk value is 0.55, which differs only 10% from the ideal tar-
get of 0.5. When plotted on the discrete risk curves in Figure 12-4, you can see that
this value falls right between D4 and D3, substantiating the choice in Chapter 11
of D3 as the decompression target.
1.0
Activity Risk
0.9
TopDev2+
Simulators TopDev1 Normal Criticality Risk
0.8
0.7 Infra+Clients
Front End TopDev2 D4
0.6
Risk
0.5 D3
0.4
D2
D1
0.3
0.2
Minimum
0.1 Decompression
Target
0.0
6 7 8 9 10 11 12 13 14 15
Duration
Unlike in Chapter 11, which used visualization of the risk chart and a judgment
call to identify the tipping point, the second derivative provides an objective and
repeatable criterion. This is especially important when there is no immediately
obvious visual risk tipping point or when the risk curve is skewed higher or lower,
making the 0.5 guideline unusable.
GEOMETRIC RISK
The risk models presented in Chapter 10 all use a form of arithmetic mean of the
floats to calculate the risk. Unfortunately, the arithmetic mean handles an uneven
distribution of values poorly. For example, consider the series [1, 2, 3, 1000]. The
arithmetic mean of that series is 252, which does not represent the values in the
series well at all. This behavior is not unique to risk calculations, and any attempt
at using an arithmetic mean in the face of very uneven distribution will yield an
unsatisfactory result. In such a case it is better to use a geometric rather than an
arithmetic mean.
The geometric mean of a series of values is the product of multiplying all the
values in the series of n values and then taking the nth root of the multiplication.
Given a series of values a1 to an, the geometric mean of that series would be:
n
Mean = n a1 * a2 *...* an = n ∏ ai
i=1
For example, while the arithmetic mean of the series [2, 4, 6] is 4, the geometric
mean is 3.63:
Mean = 3 2 * 4 * 6 = 3.63
The geometric mean is always less or equal to arithmetic mean of the same series
of values:
n ∑a i
n
∏a
i=1
i ≤ i=1
n
The two means are equal only when all values in the series are identical.
While initially the geometric mean looks like an algebraic oddity, it shines when
it comes to an uneven distribution of values. In the geometric mean calculation,
316 C HAPTER 12 A DVANCED TECHNIQUES
extreme outliers have much less effect on the result. For the example series of [1, 2,
3, 1000], the geometric mean is 8.8 and is a better representation of the fi rst three
numbers in the series.
N
(WC )NC *(WR )NR *(WY )NY *(WG )NG
Risk =
WC
where:
Using the example network of Figure 10-4, the geometric criticality risk is:
16
4 6 * 34 * 2 2 *14
Risk = = 0.60
4
The corresponding arithmetic criticality risk for the same network is 0.69. As
expected, the geometric criticality risk is slightly lower than the arithmetic criti-
cality risk.
G EOMETRIC R ISK 317
WC WC
WG
=
WC
WY = ϕ*WG
WR = ϕ 2 *WG
WC = ϕ 3 *WG
N (ϕ 3 *WG )NC *(ϕ 2 *WG )NR *(ϕ *WG )NY *(WG )NG
Risk =
ϕ 3 *WG
N ϕ 3NC +2NR +NY *WG NC +NR +NY +NG N ϕ 3NC +2NR +NY *WG N N ϕ 3NC +2NR +NY
= = =
ϕ 3 *WG ϕ 3 *WG ϕ3
3NC +2NR +NY
−3
=ϕ N
geometric mean will always be zero. The common workaround is to add 1 to all
values in the series and subtract 1 from the resulting geometric mean.
N
N
∏(F + 1) −1
i=1
i
Risk = 1 −
M
where:
Using the example network of Figure 10-4, the geometric activity risk would be:
16
1*1*1*1*1*1* 31* 31* 31* 31*11*11* 6 * 6 * 6 * 6 − 1
Risk = 1 − = 0.87
30
The corresponding arithmetic activity risk for the same network is 0.67.
Figure 12-5 illustrates the difference in behavior between the geometric risk mod-
els by plotting all of the risk curves of the example project from Chapter 11.
G EOMETRIC R ISK 319
1.0
Geometric Activity
0.9 Geometric Criticality
Geometric Fibonacci
0.8
Arithmetic Activity
0.7 Arithmetic Criticality
Arithmetic Fibonacci
0.6
Risk
0.5
0.4
0.3
0.2
0.1
0.0
6 7 8 9 10 11 12 13 14 15
Duration
You can see that the geometric criticality and geometric Fibonacci risk have the
same general shape as the arithmetic models, only slightly lower, as expected.
You can clearly observe the same risk tipping point. The geometric activity risk is
greatly elevated, and its behavior is very different from the arithmetic activity risk.
There is no easily discernable risk tipping point.
Geometric activity risk is the last resort when trying to calculate the risk of a
project with god activities. Such a project in effect has very high risk since most
of the effort is spent on the critical god activities. As explained previously, due to
the size of the god activities, the other activities have considerable float, which in
turn skews the arithmetic risk lower, giving you a false sense of safety. In contrast,
the geometric activity risk model provides the expected high risk value for projects
with god activities. You can produce a correlation model for the geometric activity
risk and perform the same risk analysis as with the arithmetic model.
320 C HAPTER 12 A DVANCED TECHNIQUES
Figure 12-6 shows the geometric activity risk and its correlation model for the
example project presented in Chapter 11.
1.0
Geometric Activity
0.9 Geometric Activity (Model)
0.8
0.5
0.4
0.3
0.2
Minimum
0.1 Decompression
Target
0.0
6 7 8 9 10 11 12 13 14 15
Duration
The point of maximum risk, 8.3 months, is shared by both the arithmetic and
geometric models. The minimum decompression target for the geometric activity
model (where the second derivative is zero) comes at 10.94 months, similar to the
10.62 months of the arithmetic model and just to the right of D3. The geometric
risk crossover points are 9.44 months and 12.25 months—a slightly narrower
range than the 9.03 months and 12.31 months obtained when using the arithmetic
activity risk model. As you can see, the results are largely similar for the two mod-
els, even though the behavior of the risk curve is very different.
Of course, instead of fi nding a way to calculate the risk of a project with god
activities, you should fi x the god activities as discussed previously. Geometric
risk, however, allows you to deal with things the way they are, not the way they
should be.
EXECUTION COMPLEXITY
In the previous chapters, the discussion of project design focused on driving edu-
cated decisions before work starts. Only by quantifying the duration, cost, and
risk can you decide if the project is affordable and feasible. However, two project
E XECUTION C OMPLEXIT Y 321
design options could be similar in their duration, cost, and risk, but differ greatly
in their execution complexity. Execution complexity in this context refers to how
convoluted and challenging the project network is.
CYCLOMATIC COMPLEXITY
Cyclomatic complexity measures connectivity complexity. It is useful in measuring
the complexity of anything that you can express as a network, including code and
the project.
Complexity = E − N + 2 * P
Complexity = 6 − 5 + 2 *1 = 3
1 A
2 B
3 C 1,2
4 D 1,2
5 E 3,4
322 C HAPTER 12 A DVANCED TECHNIQUES
In general, the more parallel the project, the higher its execution complexity will
be. At the very least, it is challenging to have a larger staff available in time for
all the parallel activities. The parallel work (and the additional work required
to enable the parallel work) increases both the workload and the team size. A
larger team will be less efficient and more demanding to manage. Parallel work
also results in higher cyclomatic complexity because the parallel work increases E
faster than it increases N. At the extreme, a project with N activities starting at the
same time and fi nishing together, where each activity is independent of all other
activities and the activities are all done in parallel, has a cyclomatic complexity of
N + 2. Such a project has a huge execution risk.
In much the same way, the more sequential the project, the simpler and less com-
plex it will be to execute. At the extreme, the simplest project with N activities is a
serial string of activities. Such a project has the minimum possible cyclomatic com-
plexity of exactly 1. Subcritical projects with very few resources tend to resemble
such long strings of activities. While the design risk of such a subcritical project is
high (approaching 1.0), the execution risk is very low.
whose members are used to working together and are at peak productivity, and a
top-notch project manager who pays meticulous attention to details and proactively
handles conflicts. Lacking these ingredients, you should take active steps to reduce
the execution complexity using the design-by-layers and network-of-networks tech-
niques described later on in this chapter.
Acual Curve
Idealized Curve
Complexity
Time
The problem with such a classic nonlinear behavior is it does not account for
compressing the project by using more skilled resources without any change to the
project network. The dashed line also presumes that complexity can be further
reduced with ever-increasing allocation of time, but, as previously stated, com-
plexity has a hard minimum at 1. A better model of the project complexity is some
kind of a logistic function (the solid line in Figure 12-7).
The relatively flat area of the logistic function represents the case of working with
better resources. The sharp rise on the left of the curve corresponds to parallel
work and compressing the project. The sharp drop on the right of the curve rep-
resents the project’s subcritical solutions (which also take considerably more time).
Figure 12-8 demonstrates this behavior by plotting the complexity curve of the
example project from Chapter 11.
324 C HAPTER 12 A DVANCED TECHNIQUES
Recall from Chapter 11 that even the most compressed solution was not materially
more expensive than the normal solution. Complexity analysis reveals that the true
cost of maximum compression in this case was a 25% increase in cyclomatic com-
plexity—an indicator that the project execution is far more challenging and risky.
16
14 Compressed
Solutions
Normal
12
10
Complexity
4 Subcritical
0
6 7 8 9 10 11 12 13 14 15
Duration
Megaprojects with many hundreds or even thousands of activities have their own
level of complexity. They typically involve multiple sites, dozens or hundreds of
people, huge budgets, and aggressive schedules. In fact, you typically see the last
three in tandem because the company fi rst commits to an aggressive schedule and
then throws people and money at the project, hoping the schedule will yield.
The larger the project becomes, the more challenging the design and the more
imperative it is to design the project. First, the larger the project, the more is at
VERY L ARGE P ROJECTS 325
stake if it fails. Second, and even more importantly, you have to plan to work in
parallel out of the gate because no one will wait 500 years—or, for that matter,
even 5 years—for delivery. Making things worse, with a megaproject the heat will
be on from the very fi rst day, because such projects place the future of the company
at stake, and many careers are on the line. You will be under the spotlight with
managers swarming around like angry yellow jackets.
The single snowflake is so risky because complexity grows nonlinearly with size.
In large systems, the increase in complexity causes a commensurate increase in
the risk of failure. The risk function itself can be a highly nonlinear function of
complexity, akin to a power law function. Even if the base of the function is almost
1, and the system grows slowly in size (one additional line of code at a time or one
more snowflake on the mountain side), over time the growth in complexity and its
compounding effect on risk will cause a failure due to a runaway reaction.
Complexity Drivers
Complexity theory2 strives to explain why complex systems behave as they do.
According to complexity theory, all complex systems share four key elements:
connectivity, diversity, interactions, and feedback loops. Any nonlinear failure
behavior is the product of these complexity drivers.
Even if the system is large, if the parts are disconnected, complexity will not raise
its head. In a connected system with n parts, connectivity complexity grows in pro-
portion to n2 (a relationship known as Metcalfe’s law3). You could even make the case
for connectivity complexity on the order of nn due to ripple effects, where any single
change causes n changes and each of those causes n additional changes, and so on.
The system can still have connected parts and not be that complex to manage and
control if the parts are clones or simple variations of one another. On the other
hand, the more diverse the system is (such as having different teams with their own
tools, coding standards, or design), the more complex and error prone that system
will be. For example, consider an airline that uses 20 different types of airplanes,
each specific for its own market, with unique parts, oils, pilots, and maintenance
schedules. This very complex system is bound to fail simply because of diversity.
Compare that with an airline that uses just a single generic type of airplane that
is not designed for any market in particular and can serve all markets, passengers,
and ranges. This second airline is not just simpler to run: It is more robust and
can respond much more quickly to changes in the marketplace. These ideas should
resonate with the advantages of composable design discussed in Chapter 4.
You can even control and manage a connected diverse system as long as you do not
allow intense interactions between the parts. Such interactions can have destabi-
lizing unintended consequences across the system, often involving diverse aspects
such as schedule, cost, quality, execution, performance, reliability, cash flow,
2. https://en.wikipedia.org/wiki/Complex_system
3. https://en.wikipedia.org/wiki/Metcalfe's_law
VERY L ARGE P ROJECTS 327
customer satisfaction, retention, and morale. Unabated, these changes will trigger
more interactions in the form of feedback loops. Such feedback loops magnify the
problems to the point that input or state conditions that were not an issue in the
past become capable of taking the system down.
When the quality of the whole depends on the quality of all the components, the
overall quality is the product of the qualities of the individual elements.4 The result
is highly nonlinear decay behavior. For example, suppose the system performs a
complex task composed of 10 smaller tasks, each having a near-perfect quality of
99%. In that case the aggregate quality is only 90% (0.9910 = 0.904).
The more components the system has, the worse the effect becomes, and the more
vulnerable the system becomes to any quality issues. This explains why large proj-
ects often suffer from poor quality to the point of being unusable.
4. Michael Kremer, “The O-Ring Theory of Economic Development,” Quarterly Journal of Economics
108, no. 3 (1993): 551–575.
328 C HAPTER 12 A DVANCED TECHNIQUES
NETWORK OF NETWORKS
The key to success in large projects is to negate the drivers of complexity by reduc-
ing the size of the project. You must approach the project as a network of net-
works. Instead of one very large project, you create several smaller, less complex
projects that are far more likely to succeed. The cost will typically increase at least
by a little, but the likelihood of failure will decrease considerably.
With a network of networks, there is a proviso that the project is feasible, that
it is somehow possible to build the project in this way. If the project is feasible,
then there is a good probability that the networks are not tightly coupled and that
the segmentation into separate subnetworks is possible. Otherwise, the project is
destined to fail.
Once you have the network of networks, you design, manage, and execute each of
them just like any other project.
As with all design efforts, your approach to designing the network of networks
should be iterative. Start by designing the megaproject, and then chop it into indi-
vidual manageable projects along and beside the mega critical path. Look for
junctions where the networks interact. These junctions are a great place to start
with the segmentation. Look for junctions not just of dependencies but also of
time: If an entire set of activities would complete before another set could start,
then there is a time junction, even if the dependencies are all intertwined. A more
VERY L ARGE P ROJECTS 329
advanced technique is to look for a segmentation that minimizes the total cyc-
lomatic complexity of the network of networks. In this case, P greater than 1 is
acceptable for the total complexity, while each subnetwork has P of 1.
Figure 12-9 shows an example megaproject, and Figure 12-10 shows the resulting
three independent subnetworks.
Quite often, the initial megaproject is just too messy for such work. When this is
the case, investing the time to simplify or improve the design of the megaproject
will help you identify the network of networks. Look for ways to reduce complex-
ity by introducing planning assumptions and placing constraints on the megaproj-
ect. Force certain phases to complete before others start. Eliminate solutions
masquerading as requirements.
Decoupling Networks
The network of networks will likely include some dependencies that scuttle the
segmentation or somehow prevent parallel work across all networks, at least ini-
tially. You can address these by investing in the following network-decoupling
techniques:
Creative Solutions
Although there is no set formula for constructing the network of networks, the
best guideline is to be creative. You will often fi nd yourself resorting to creative
solutions to nontechnical problems that stifle the segmentation. Perhaps political
struggles and pushback concentrate parts of the megaproject instead of distribut-
ing them. In such cases, you need to identify the power structure and defuse the
situation to allow for the segmentation. Perhaps cross-organizational concerns
involving rivalries prevent proper communication and cooperation across the net-
works, manifesting as rigid sequential flow of the project. Or maybe the develop-
ers are in separate locations, and management insists on providing some work for
each location, in a functional way. Such decomposition has nothing to do with
the correct network of networks or where the real skills reside. You may need to
propose a massive reorganization, including the possibility of relocating people,
to have the organization reflect the network of networks, rather than the other
way around (more on this topic in the next section on countering Conway’s law).
Perhaps some legacy group is mandated to be part of the project due to personal
favors. Instead of segmentation, this creates a choke point for the project because
S MALL P ROJECTS 331
everything else now revolves around the legacy group. One solution might be to con-
vert the legacy group into a cross-network group of domain expert test engineers.
Finally, try several renderings of the network of networks by different people, for
the simple reason that some may see simplicity where others do not. Given what
is at stake, you must pursue every angle. Take your time to carefully design the
network of networks. Avoid rushing. This will be especially challenging since
everyone else will be aching to start work. Due to the project’s size, however, cer-
tain failure lurks without this crucial planning and structuring phase.
If Conway’s law poses a threat to your success, a good practical way to counter it is
to restructure the organization. To do so, you first establish the correct and adequate
design, and then you reflect that design in the organizational structure, the reporting
structure, and the communication lines. Do not shy away from proposing such a
reorganization as part of your design recommendations at the SDP review.
Although Conway referred originally to system design, his law applies equally well
to project design and to the nature of the network. If your project design includes
a network of networks, you may have to accompany your design with a restruc-
turing of the organization that mimics those networks. The degree to which you
will have to counter Conway’s law even in a regular-size project is case-specific.
Be aware of the organizational dynamics and devise the correct structure if your
observation (or even your intuition) is telling you it is necessary.
SMALL PROJECTS
On the other side of the scale from very large projects are small (or even very
small) projects. Counterintuitively, it is important to carefully design such small
projects. Small projects are even more susceptible to project design mistakes than
5. Melvin E. Conway, “How Do Committees Invent?,” Datamation, 14, no. 5 (1968): 28–31.
332 C HAPTER 12 A DVANCED TECHNIQUES
are regular-size projects. Due to their size they respond much more to changes
in their conditions. For example, consider the effects of assigning a person incor-
rectly. With a team of 15 people, such a mistake affects about 7% of the available
resources. With a team of 5 people, it affects 20% of the project resources. A project
may be able to survive a 7% mistake, but a 20% mistake is serious trouble. A large
project may have the resource buffer to survive mistakes. With a small project,
every mistake is critical.
On the positive side, small projects may be so simple that they do not require
much project design. For example, if you have only a single resource, the project
network is a long string of activities whose duration is the sum of duration across
all activities. With very minimal project design, you can know the duration and
cost. There is also no need to build the time–cost curve or calculate the risk (it
will be 1.0). Since most projects have some form of a network that differs from a
simple string or two, and since you should avoid subcritical projects, in a practical
sense you almost always design even small projects.
DESIGN BY LAYERS
All the project design examples so far in this book have produced their network of
activities based on the logical dependencies between activities. I call this approach
design by dependencies. There is, however, another option—namely, building the
project according to its architecture layers. This is a straightforward process when
using The Method’s architectural structure. You could fi rst build the Utilities,
then the Resources and the ResourceAccess, followed by Engines, Managers, and
Clients, as shown in Figure 12-11. I call this technique design by layers.
1 e1 ss
1 1 er
1
t1
lity urc ce ine ag en
Uti R e so Ac En
g
Ma
n Cli
Utility 2 Res. 2 Access 2 Engine 2 Mgr 2 Client 2 Testing
Uti Re Ac Ma
lity so ce na
urc ss ge
3 e3 3 r3
As shown in Figure 12-11, the network diagram is basically a series of pulses, each
corresponding to a layer in the architecture. While the pulses are sequential and often
serialized, internally each pulse is constructed in parallel. The Method’s adherence
to the closed architecture principle enables this parallel work inside a pulse.
D ESIGN BY L AYERS 333
When designing by layers, the schedule is similar to the same project designed by
dependencies. Both cases result in a similar critical path composed of the compo-
nents of the architecture across the layers.
Designing by layers can increase the team size and, in turn, the direct cost of the
project. With design by dependencies, you fi nd the lowest level of resources allow-
ing unimpeded progress along the critical path by trading float for resources. With
design by layers, you may need as many resources as are required to complete the
current layer. The team has to work in parallel on all activities within each pulse
and complete all of them before beginning the next pulse. You must assume all the
components in the current layer are required by the next layer.
With that in mind, designing by layers has the clear advantage of producing a
very simple project design to execute. It is the best antidote for a complex project
network and can reduce overall cyclomatic complexity by half or more. In the-
ory, since the pulses are sequential, at any moment in time the project manager
has to contend with only the execution complexity of each pulse and the support
activities. The cyclomatic complexity of each pulse roughly matches the number
of parallel activities. In a typical Method-based system, this cyclomatic complex-
ity is as low as 4 or 5, whereas the cyclomatic complexity of projects designed by
dependencies can be 50 or more.
334 C HAPTER 12 A DVANCED TECHNIQUES
Many projects in the software industry tolerate both schedule slips and overca-
pacity; therefore their real challenge is complexity, not duration or cost. When
possible, with Method-based systems, I prefer designing by layers to address the
otherwise risky and complex execution. As with most things when it comes to
project design, designing by layers is predicated on having the right architecture
in the first place.
You can combine the techniques of both design by layers and design by depen-
dencies. For example, the example project in Chapter 11 moved all of the infra-
structure Utilities to the beginning of the project, despite the fact that their logical
dependencies would have allowed them to take place much later in the project.
The rest of the project was designed based on logical dependencies.
Note Only the design methodology for the initial network dependencies
differs for design by layers and design by dependencies. All other project
design techniques from the previous chapters apply in exactly the same way.
A fi nal observation is that designing the project by layers basically breaks the proj-
ect into smaller subprojects. These smaller projects are done sequentially and are
separated by junctions of time. This is akin to breaking a megaproject into smaller
networks and carries very similar benefits.
13
PROJECT DESIGN EXAMPLE
While Chapter 11 illustrated an example project, its main purpose was to teach the
thought process when using project design techniques and how they interrelate.
Only secondarily did the example demonstrate end-to-end project design. The
focus in this chapter is how to drive project design decisions in a real-life project
and when to apply which project design techniques. The project designed here
builds the TradeMe system, the example system from Chapter 5. As with the sys-
tem design case study in Chapter 5, this chapter derives directly from the actual
project that IDesign designed for one of its customers. The design team consisted
of two IDesign architects (a veteran and an apprentice) and a project manager
from the customer. While this example scrubs or obfuscates the specific business
details, I present here the project design as it was. Both the system and the project
design effort were completed in less than a week.
All of the data and calculations used in this chapter are available as part of the
downloadable reference fi les for this book. However, when reading this chapter
for the fi rst time, I advise you to resist the temptation to crosscheck constantly
between the text and the fi les. Instead, you should focus on the reasoning leading
to those calculations and the interpretations of the results. Once those are in hand,
you can use this chapter for reference as you explore the data in detail to confi rm
your understanding and to practice the techniques.
Caution This chapter does not duplicate the previous chapters and avoids
explaining specific project design techniques. A thorough understanding of
the prior chapters will help you get the most out of this example.
ESTIMATIONS
The TradeMe project design effort performed two types of estimations: individ-
ual activity estimations and an overall project estimation. The individual activity
estimations were used in the project design solutions, and the overall estimation
served to validate the project design results.
335
336 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
When building the list of activities, the design team expanded each list to include
the individual activities and the duration estimation for each activity. The team
also indicated the designated role responsible for each activity according to the
customer’s process or their own experience.
Estimation Assumptions
The design team clearly documented any initial constraints or assumptions on the
estimations. The TradeMe project relied on the following estimation assumptions:
• Detailed design. The individual developers were capable of doing the detailed
design, so each coding activity contained its own detailed design phase.
• Development process. The team was set to build the system quickly and cleanly,
while relying on most of the best practices in this book.
Structural Activities
The structural activities of TradeMe derived directly from the system architecture
(see Figure 5-14). These activities included Utilities, Resources, ResourceAccess,
Managers, Engines, and Clients, and were mostly tasks for developers. The archi-
tect was responsible for the key activities of the Message Bus and the Workflow
Repository. Table 13-1 lists the duration estimation for some of the structural
coding activities of the project.
Table 13-1 Duration estimation for some of the structural coding activities
ID Activity Duration Role
(days)
14 Logging 10 Developer
16 Security 20 Developer
18 Payments DB 5 DB Architect
… … … …
… … … …
… … … …
… … … …
… … … …
The abstract Manager was a base service for the rest of the Managers in the sys-
tem. It contained the bulk of the workflow management as well as the message
bus interaction. Derived Managers executed specific workflows. The other two
activities were both testing-related. The System Test Harness was owned by
a test engineer, but the Regression Test Harness was owned by a developer.
338 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
Noncoding Activities
TradeMe had many noncoding activities, which tended to concentrate at the begin-
ning or the end of the project. The noncoding activities were owned by various
members of the core team, the test engineer, testers, or external experts such as a UX
designer. These activities are shown in Table 13-3. This list was also driven by the
company’s development process, planning assumptions, and commitment to quality.
BEHAVIORAL DEPENDENCIES
When building the fi rst set of dependencies, the design team examined the use
cases and the call chains supporting them. For each call chain, they listed all
the components in the chain (often in the architecture hierarchy order, such as
Resources fi rst and Clients last) and then added the dependencies. For example,
when they examined the Add Tradesman use case (see Figure 5-18), the design
team observed that the Membership Manager calls the Regulation Engine,
so they added the Regulation Engine as a predecessor to the Membership
Manager.
Distilling dependencies from the use cases required multiple passes, because each
call chain potentially revealed different dependencies. The design team even dis-
covered some missing dependencies in the call chains. For example, based solely
on the call chains of Chapter 5, the Regulation Engine required only the
Regulation Access service. Upon further analysis, the design team decided
that Regulation Engine depended on Projects Access and Contractors
Access as well.
Operational Dependencies
Some code dependencies were implicit in the call chains due to the system’s oper-
ational concept. In TradeMe, all communication between Clients and Managers
340 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
(and between Managers and other Managers) flowed over the message bus, creat-
ing an operational (not structural) dependency between them. The dependencies
indicated that the Clients needed the Managers ready for test and deployment.
NONBEHAVIORAL DEPENDENCIES
TradeMe also contained dependencies that could not be traced directly to the
required behavior of the system or its operational concept. These involved coding
and noncoding activities alike. Such dependencies originated mostly with the com-
pany’s development process and TradeMe’s planning assumptions. For example,
the new system had to carry forward the legacy data from the old system. Data
migration necessitated that the new Resources (the databases) complete first, so
the data migration activity depended on the Resources. Similarly, the completion
of the Managers required the Regression Test Harness. In addition, at the
time of the project’s design, the plan still had to account for a few remaining front-
end activities. Finally, the company had its own release procedures and internal
dependencies, which were incorporated as dependencies between the concluding
activities.
Similar logic applied to security. While the call chain analysis indicated that only
the Clients and the Managers needed to take explicit security actions, security
was so important that the project had to assure Security completed before all
business logic activities. This ensured all activities had security support if they
needed it and avoided security becoming an after-thought or a late-stage add-on.
Reducing Complexity
The project design team also overrode dependencies so that they could reduce
the complexity of the emerging network. Specifically, they changed the following
dependencies:
THE N ORMAL S OLUTION 341
SANITY CHECKS
With the initial network laid out, the design team performed the following sanity
checks:
1. Verified the TradeMe project had a single start activity and a single end
activity.
2. Verified that every activity in the project resided on a path ending somewhere
on the critical path(s).
3. Verified that the initial risk measurement yielded a relatively low risk number.
4. Calculated the duration of the project without any resource assignment. This
came to 7.8 months and later would serve as important check of the normal
solution.
• Core team. The core team was required throughout the project. The core team
consisted of a single architect, a project manager, and a product manager. The
core team was allowed to work directly on the project only infrequently. Such
work included key high-risk activities done by the architect and producing the
user manual, which was assigned to the product manager.
342 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
• Access to experts. The project had access to experts or specialists, such as test
engineers, DB architects, and UX/UI designers.
• Assignments. There was a 1:1 assignment of developers to services or other cod-
ing activities. On top of assigning based on floats, whenever possible, TradeMe
maintained task continuity (see Chapter 7).
• Quality control. A single quality control tester was required from the start of
construction to the end of the project. The tester was treated as a direct cost
only during the system test activity. One additional tester was required for the
system testing activity.
• Build and operations. A single build, configuration, deployment, and DevOps
specialist was required from start of construction until the end of the project.
• Developers. Developers between tasks were considered to be a direct cost rather
than an indirect cost. TradeMe’s high quality expectations eliminated the need
for developers during system testing.
Table 13-4 outlines which roles were required in each phase of the project.
Architect X X X X
Project Manager X X X X
Product Manager X X X X
Testers X X X
DevOps X X X
Developers X X
NETWORK DIAGRAM
Assigning resources to the various activities affected the project network. In sev-
eral places, the network included dependencies on the resources in addition to
the logical dependencies between the activities. After consolidating the inherited
dependencies, the network diagram looked like Figure 13-1.
PLANNED PROGRESS
Figure 13-2 captures the planned earned value of the fi rst normal solution. The
duration of this solution stood at 7.8 months, indicating that the staffing assign-
ments had not extended the critical path. The chart in Figure 13-2 has the general
shape of a shallow S curve but is not ideal. The project starts reasonably well, but
the second half of the project is not very shallow. The steep planned earned value
curve was also reflected in the somewhat elevated risk values. Both the activity risk
and criticality risk were 0.7.
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
10%
0%
1/5 2/4 3/6 4/5 5/5 6/4 7/4 8/3 8/28
Date
Test Engineer/UX
Developers
Testers
14 DevOps
Architect
12 Product Manager
Project Manager
10
Staffing
0
1/5 1/26 2/9 2/16 2/23 3/6 3/16 3/23 3/30 4/6 4/13 4/20 4/27 5/4 5/18 5/25 6/12 6/28 7/10 7/31 8/14 8/28
Date
The calculated project efficiency was 32%. Since the upper practical limit is 25%,
such high efficiency was questionable. Taken together, the direct cost higher than
the indirect cost, the conspicuous peak in the staffi ng distribution chart, and
the high efficiency strongly indicated overly aggressive assumptions about staff-
ing elasticity. The solution expected that across all the parallel network paths,
resources would always be available at the right time to maintain progress. The
rather steep planned earned value chart visualized this expectation. In short, this
fi rst attempt at the normal solution assumed a very efficient team, likely one too
efficient to be practical.
RESULTS SUMMARY
Table 13-5 summaries the project metrics of this fi rst normal solution.
346 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
Peak staffi ng 12
Efficiency 32%
COMPRESSED SOLUTION
The next step was to consider options for accelerating the project. Due to the
presence of the two critical paths, the best course of action was to compress this
project by enabling parallel work.
From Figure 13-1, it was evident that the Manager services (activities 36, 37, 38, 39),
along with the Regression Test Harness (activity 40), capped the two critical
paths, as well as the two near-critical paths. The Clients (activities 42, 43, 44, 45),
in turn, depended on the completion of all the Managers, prolonging the project.
This made the Clients and the Managers natural candidates for compression.
1. A contract design activity that decoupled the Clients from the Manager. The
various contract design activities could perhaps have started after the SDP
review, but it was deemed better to postpone them until after the infrastruc-
ture was complete. Estimated work per contract: 5 days.
2. A Manager simulator that provided a good-enough implementation of the
Manager’s contract. The simulators had to enable full development of the
Clients, which now depended on the simulators, not the actual Managers. The
C OMPRESSED S OLUTION 347
Managers Integration
.
.
.
.
.
.
.
.
.
Co M
nt an
ra ag
ct e ts
s rs en
De Cli
sig
n Managers
Simulators
ASSIGNING RESOURCES
The rest of the steps for the compressed solution were virtually identical to those
for the normal solution. However, the design team discovered they could reduce
the staff by two developers throughout the project by using the architect for one
development activity and by pushing the schedule out one week. The company
judged trading the slight delay for the reduced staff as acceptable given its chal-
lenge in securing more developers. The duration of the compressed solution came
in at 7.1 months, a 3-week (9%) acceleration compared with the normal solution
(7.8 months). The new resources did consume more of the floats, and the new risk
number for the project was 0.74.
PLANNED PROGRESS
Figure 13-5 shows the planned earned value for the compressed solution. The
curve now tapers somewhat at the end of the project, better than that in the nor-
mal solution.
100%
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
10%
0%
1/5 2/4 3/6 4/5 5/5 6/4 7/4 8/3
Date
Test Engineer/UX
Developers
Testers
DevOps
14 Architect
Product Manager
12
Project Manager
10
Staffing
0
1/5 1/26 2/9 2/16 2/23 3/2 3/9 3/16 3/23 3/30 4/6 4/13 4/20 5/4 5/11 5/25 6/11 6/12 6/19 6/26 7/3 7/10 7/14 8/7
Date
RESULTS SUMMARY
Table 13-6 summarizes the metrics of the compressed solution. The compressed
solution made an already challenging project (see Figure 13-1) more challenging
and created an unrealistically high efficiency expectation. Its major downside,
350 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
however, was the integration—not the increase in execution complexity. The mul-
tiple, parallel integrations occurring near the end of the project offered no lee-
way. If any of them went awry, the team had no time for repairs. The increase in
both the execution complexity and the integration risk in exchange for less than a
month of compression was not a good trade.
Peak staffi ng 12
Efficiency 37%
Even so, this compression attempt was not a waste of time—it proved the com-
pressed solution would be an exercise in futility. The compressed solution also
helped the design team to better understand the project and provided another
point on the time–cost curve.
DESIGN BY LAYERS
The main problem with the fi rst normal solution was not the unrealistic efficiency
but the complexity of the project network. That complexity is evident just by
examining the (already simplified) network diagram in Figure 13-1. The cyclo-
matic complexity of the network is 33 units. Coupled with the high efficiency
expected of the team, this implied a high execution risk.
Instead of confronting the high complexity, the design team chose to redesign the
project by architecture layers, as opposed to the logical dependencies between
the activities. This produced mostly a string of pulses of activities. The pulses
corresponded to the layers of the architecture or the phase of the project: front
end, infrastructure and foundational work, Resources, ResourceAccess, Engines,
Managers, Clients, and release activities (Figure 13-7).
D ESIGN BY L AYERS 351
UI Design
UX Design Manual
Engines, ip
sh
Training, Resource Abstract b er
Front End Infrastructure Resources Access Manager em Market Clients Release
M
Ed
uc
at
io
n
Test Plan
Test Harness
While the pulses were serialized and sequential to each other, internally the
pulses were done in parallel. In Figure 13-7, all the pulses are collapsed except for
the expanded Manager’s pulse. A few remaining support activities, such as UI
Design and the Test Harness, were not part of the string of pulses, but they
had very high float.
An instantly noticeable aspect of Figure 13-7 is how simple that network is com-
pared with that of Figure 13-1. Since the pulses were sequential in time, the project
manager would only have to contend with the complexity of each pulse and its
support activities. In TradeMe, the complexity of the individual pulses was 2, 4,
5, 4, 4, 4, 4, and 2. The complexity of the support activities was 1 and, due to their
high float, had essentially no effect on the execution complexity.
STAFFING DISTRIBUTION
Figure 13-8 shows the planned staffing distribution for the design-by-layers solu-
tion. The overall shape of the staffi ng distribution chart was satisfactory. The
project needed only 4 developers, and staffing peaked at 11 people.
352 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
Test Engineer/UX
Developers
Testers
14 DevOps
Architect
12 Product Manager
Project Manager
10
Staffing
0
1/5 1/26 2/9 2/16 2/23 3/6 3/20 3/27 4/3 4/10 4/24 5/1 5/8 5/29 6/12 6/22 7/16 7/24 8/10 8/24 9/4
Date
RESULTS SUMMARY
Table 13-7 shows the project metrics for designing TradeMe by layers.
Peak staffi ng 11
Efficiency 31%
SUBCRITICAL SOLUTION
The design-by-layers solution called for four developers. The company was con-
cerned about what would happen if it was unable to get those four developers. It
was therefore important to investigate the implications of going subcritical. The
planning assumptions still allowed for access to external experts.
For this project, any design-by-layers solution with fewer than four developers
became subcritical, so the design team chose to explore a two-developer solution.
These developers were assigned the database design as well. The subcritical net-
work diagram was similar to the one in Figure 13-7 except that internally each
pulse consisted of only two parallel strings of activities.
100%
R 2 = 0.98
90%
80%
70%
60%
Earned Value
50%
40%
30%
20%
Planned Progress
10%
Planned Progress (Model)
0%
1/5 2/4 3/6 4/5 5/5 6/4 7/4 8/3 9/2 10/2 11/1 12/1
Date
The subcritical nature of the solution was also reflected by its risk index of 0.84.
If the company had to pursue this option, the design team recommended decom-
pressing the project by at least a month. Decompression pushed the schedule into
the 12-month range, 50% or more longer than the design-by-layers solution.
RESULTS SUMMARY
Table 13-8 shows the project metrics for the subcritical solution.
Peak staffi ng 9
Average developers 2
Efficiency 25%
The subcritical time and cost metrics (11.1 months and 74.1 man-months) compared
favorably to those for the overall estimation (10.5 months and 74.6 man-months),
differing by about 5% in duration and less than 1% in cost. This correlation sug-
gested that the subcritical solution numbers were the likely option for the project.
The more realistic 25% efficiency also gave credence to the subcritical solution.
P L ANNING AND R ISK 355
In short, for TradeMe, design by layers was comparable to or better than the first
normal solution in every respect except risk. Even if the design-by-layers solution
had cost more and taken longer, its execution simplicity made it the obvious choice
for TradeMe. The design-by-layers solution was also far better than the subcriti-
cal solution derived from it. The subcritical solution cost more, took longer, and
was riskier. The design team adopted the design-by-layers solution as the normal
solution for the remainder of the analysis.
RISK DECOMPRESSION
The design-by-layers solution had elevated risk and critical pulses, which the design
team mitigated by using risk decompression. Since the appropriate amount of decom-
pression was unknown, the design team tried decompressing by 1 week, 2 weeks,
4 weeks, 6 weeks, and 8 weeks, and observed the risk behavior. Table 13-9 shows the
risk values of the three design options and the five decompression points.
356 C HAPTER 13 P ROJECT D ESIGN E X AMPLE
Table 13-9 Risk values for the options and decompression points
Option Duration Criticality Risk Activity Risk
(months)
Figure 13-10 plots these options and decompression points against the timeline.
The criticality risk behaved as expected, and the risk dropped with decompression
along some logistic function. The activity risk also dropped with decompres-
sion, but a gap appeared between the two curves because the activity risk model
did not respond well to an uneven distribution of the floats. The calculations that
produced the values in Table 13-9 addressed this issue by adjusting the float outli-
ers as described in Chapter 11—that is, by replacing the outliers with the average
of the floats plus one standard deviation of the floats. In this case, the adjustment
was simply insufficient. A float adjustment at half a standard deviation aligned
the curves perfectly. However, the design team chose to just use the criticality risk
curve, which did not require any adjustments. The team observed that decompres-
sion beyond D4 was excessive because the risk curve was leveling out.
With the values in Table 13-9, the design team found a polynomial correlation
model for the risk curve with R 2 of 0.96:
Using the risk model, maximum risk was at 7.4 months, with a risk value of
0.78. This point was between the deign-by-dependencies solution’s 7.8 months and
the compressed solution’s 7.1 months (see Figure 13-11). The design team removed the
compressed solution from consideration because it was past the point of maximum
risk. Even the design-by-dependencies solution was borderline risk-wise: At 7.8
P L ANNING AND R ISK 357
months, the risk was already 0.75, the maximum recommended value. The design-
by-layers solution was at a comfortable 0.68 risk. The point of minimum risk was
at 9.7 months with a risk value of 0.25.
1.0
Activity Risk
0.9 Criticality Risk
0.8 Compressed By Layers
0.7 D1
0.6 By Dependencies D2
Risk
0.5 D3
D4
0.4 D5
0.3
0.2
Minimum
0.1 Decompression
Target
0.0
7.0 7.5 8.0 8.5 9.0 9.5 10.0
Duration
Table 13-10 captures the risk value of these points, and Figure 13-11 visualizes
them along the risk model curve.
D2 8.53 0.53
D3 9.0 0.38
1.0
0.9
Max Risk
0.8 Compressed
By Dependencies
By Layers
0.7
0.6
Minimum Direct Cost
D2
Risk
0.4 D3
0.2
0.1
0.0
7.0 7.5 8.0 8.5 9.0 9.5 10.0
Duration
The last technique put to bear on fi nding the decompression target was calculating
the point of minimum direct cost. However, the direct cost at the decompression
points was unknown.
Examining Figure 13-8 and Table 13-7, the design team conservatively estimated
that the decompression required three out of the four developers to keep working
during the decompression. This allowed the team to calculate the direct cost for
extending the project to the D5 decompression point. The design team added that
extra direct cost to the known direct cost of the design-by-layers solution, which
provided a direct cost curve and a well-fitted correlation model:
Using the direct cost formula, the design team found the point of minimum direct
cost at 8.46 months, right before D2. Substituting the 8.46-month duration into
the risk formula provided a risk of 0.56. The duration difference between the min-
imum point of the direct cost model and the zero point of the second derivative of
the risk model was 1%, confi rming D3 as the decompression target. Incidentally,
the minimum direct cost was 31.4 man-months, while the direct cost at D3 was
32.2 man-months, a difference of merely 3%.
RECALCULATING COST
Recommending D3 required the design team to provide the total cost at that point.
While the direct cost was known from the prior formula, the indirect cost was
unknown across the decompression range. The design team modeled the indirect
cost for the three known solutions, obtaining a simple straight line described by
the following formula:
The design team added the direct and indirect cost equations together to come up
with the formula for the total cost in the system:
Note Although they did not do so in real time with TradeMe’s customer,
the design team used the direct cost and risk models to find the project’s risk
crossover points. These came at 7.64 months and 0.77 risk (for the too-
risky crossover point) and at 9.47 months and 0.27 risk (for the too-safe risk
crossover point). These points coincided nicely with the guidelines of 0.75
and 0.3, respectively, and confirmed the validity of the project design points
discussed previously.
In addition to this optimal point, the design team presented the design-by-
dependencies solution to the company’s decision makers. It demonstrated that any
attempt of decreasing the schedule would drastically increase the design risk and
the execution risk due to high complexity and the unrealistic expected efficiency
of the team.
Because of the potential resource shortage, the design team found it necessary to
include the subcritical solution, but only with adequate decompression. Repeating
similar steps as for the design-by-layers solution, the decompressed subcritical
solution provided a risk of 0.47, a duration of 11.8 months, and a total cost of 79.5
man-months. The decompressed subcritical solution was presented both to show
the consequences of understaffing the project and to show that the project was still
feasible, if the need should arise.
Due to their higher risk, there was no point in considering the non-decompressed
options of the design-by-layers and subcritical solutions. Table 13-11 summarizes
the project design options that the design team presented at the SDP review.
For the presentation, the design team renamed the design options to avoid project
design jargon such as “normal,” “decompression,” “subcritical,” and “by layers.”
In Table 13-11, the label “Activity Driven” stands for design by dependencies,
“Architecture Driven” stands for design by layers, and “Understaffed” stands for
subcritical.
The table used plain-language terms such as “High” and “Low” for complexity
and rounded all numbers other than the risk values. The table gently prodded the
decision makers toward the decompressed design-by-layers solution.
14
CONCLUDING THOUGHTS
The previous chapters focused on the technical aspects of designing a project.
Certainly, you can view project design as a technical design task. After practicing
project design for decades, I fi nd that it is actually a mindset, not just an expertise.
You should not simply calculate the risk or the cost and try to meet your commit-
ments. You must strive for a complete superiority over every aspect of the project.
You should prepare mitigations for everything the project can throw at you—
which requires going beyond the mechanics and the numbers. You should adopt
a holistic approach that involves your personality and attitude, how you interact
with management and the developers, and the recognition of the effect that design
has on the development process and the product life cycle. The ideas I have laid out
for system and project design in both parts of this book open a portal to a parallel
level of excellence in software engineering. It is up to you to keep that portal open,
to keep improving, to refine these ideas, to develop your own style, and to adapt.
This concluding chapter advises how you should approach these aspects, but more
importantly shows you how to continue the journey.
361
362 C HAPTER 14 C ONCLUDING THOUGHTS
optimal point could be both huge in absolute terms and likely to surpass the cost
of designing the project.
If you choose the self-funding route, would you invest in project design? Would
this investment be a little investment in time and effort or a large one? Would you
say that you do not have time for project design? Would you say that it is better to
just start building something and figure things out later, or will you do whatever it
takes to fi nd out if the project is affordable before becoming broke and destitute?
Would you skip any of the techniques or analysis of project design? Even if you
can afford the project, would you not still design the project to identify the risk
exclusion zones? Would you repeat all the calculations a second time for good
measure? Would you first design the project to see if you should sell your house
and quit your job? After all, if the project requires $3 million and you were able to
muster only $2 million, you should keep the house, not the new startup. The same
goes for the duration of the project. If you have only a one-year marketing window
and the project is really a two-year project, then you should do nothing. When
self-funded, would you also not prefer that your developers work against detailed
assembly instructions of the project, as opposed to wasting your scant resources
trying to figure it out on their own?
Next, imagine a project where the manager is held personally liable for any fail-
ure to meet the commitments. Instead of the manager earning a nice bonus when
meeting the commitments, in the case of failure the manager has to pay out of
pocket for the project cost overruns, if not the lost sales, as well as any contrac-
tual obligations. In such a situation, would the manager oppose project design or
WHEN TO D ESIGN A P ROJECT 363
insist on it? Would the manager resist project design because “that is not how we
do things here”? Would the manager invest a little or a lot in system and project
design to ensure the commitments are aligned with what the team can produce?
Would the manager avoid fi nding out where the death zone is? Would the manager
give up on sound architecture that will ensure the project design itself will not
change much? Would the manager say that since no one is working this way, that
is a good enough reason not to design the project?
The dissonance is stark. Most people have a callous, cavalier, and complacent
attitude when the company is paying. Most people avoid thinking for themselves
because it is so much easier to dogmatically follow the common practices of a fail-
ing industry and use that as an excuse when squandering other people’s money.
Most just make excuses such as that they do not have the time, or that project
design is the wrong process, or that project design is over-engineered. Yet when
their head is on the chopping block, the same people become project design zeal-
ots. Such a difference in behavior is a direct result of lack of integrity, both per-
sonal and professional. The real answer to the question of when to design a project
is when you have integrity.
Nothing else really matters. Most managers cannot tell the difference between a
great design and a horrible design, so they will never promote or reward you based
on architecture alone. However, if you treat the company’s money as your own,
if you thoroughly design the project to fi nd the most affordable and safest way
of building the system, and if you flat out refuse any other course of action, the
higher-ups will notice. By showing the utmost respect for the company’s money,
you will earn their respect, because respect is always reciprocal. Conversely, people
do not respect those who are disrespectful toward them. When you are account-
able for your actions and decisions, your worth in the eyes of top management will
drastically increase. If you repeatedly meet your commitments, you will earn the
trust of the top brass. When the next opportunity comes, they will give it to the
one person whom they trust to be respectful of their time and money: you.
This advice is drawn from my own career. Before I was 30 years old, I led the soft-
ware architecture group of a Fortune 100 company in Silicon Valley, the most com-
petitive place in the world for the software industry. My rise to the top had little to
364 C HAPTER 14 C ONCLUDING THOUGHTS
do with my architecture prowess (as discussed, that hardly ever amounts to much).
I did, however, always bundle my system design with project design, and that made
all the difference. In my mind, the company’s money was my money.
FINANCIAL ANALYSIS
With most sizable projects, someone somehow must decide how to pay for the
project. Project managers may even have to present the expected burn rate or
cash flow of the project. This is especially important with large projects. For
these projects, the customer is typically unable to pay a lump sum either at
the beginning or the end of the project, requiring the developing organization
to fund the effort via a payment schedule. In most cases, lacking any kind of
knowledge about the project flow or its network design, fi nancial planning
is reduced to some amalgam of guesswork, wishful thinking, and functional
decomposition of payments (e.g., a certain amount per feature). This often is
a recipe for disaster. As it turns out, there is no need for guesswork about the
fi nancial side of the project. With very little extra work, you can extend your
project design into a fi nancial analysis of the project.
From your staffing distribution, you can calculate the cost of each time slice of
the project. Next, present those costs as a running sum, either as absolute values
or in relative terms (percentages). You can even present direct versus total cost
over time, either numerically or graphically (for fi nancial planning, you should
use monetary units rather than effort units, so you need to know the cost of a
man-month at your organization).
The reason for mentioning the fi nancial planning aspect of the project in a
book about software design has little to do with fi nance, as valuable as that is.
In most software projects, the people who are trying to design the system and
the project, to invest in best practices, and to meet their commitments face a
grueling uphill struggle, as if everybody else is bent on doing everything the
worst possible way.
GENERAL GUIDELINES
Do not design a clock.
Even the best project design solution just gives you a fighting chance during
execution—nothing more. Note that “best” in this context means a design that
is the most calibrated to what your team can produce (in terms of time, cost, and
risk), not necessarily the optimal design.
Without the correct system architecture, at some point the system design will
change. Those changes mean that you will be building a different system which
will void the project design. Once that happens, it does not matter if you had the
best project design at the beginning of the project. As prescribed in the fi rst part
of the book, you need to invest the time to deal with the volatilities, whether or
not you use the structure of The Method to do so.
366 C HAPTER 14 C ONCLUDING THOUGHTS
DESIGN STANCE
You should not apply the ideas in this book dogmatically.
You should adapt the project design tools to your particular circumstances with-
out compromising on the end result. This book aims to show you what is possible,
to trigger your natural curiosity, to encourage you be creative, and to lead.
When possible, do not design a project in secret. Design artifacts and a visible
design process build trust with the decision makers. If stakeholders ask, educate
them about what you are doing and why you are doing things this way.
OPTIONALITY
Communicate with management in Optionality.
When you engage with management, speak the language I call Optionality: suc-
cinctly describing the options from which management can choose, and enabling
objective evaluation of these options. This is very much aligned with a core con-
cept in project design: There is no “the” project. There are always multiple options
for building and delivering any system. Each option is some viable combination of
time, cost, and risk. You should therefore design several such options from which
management may chose.
The essence of good management is choosing the right option. Moreover, giving
people options empowers them. After all, if there is truly no other option, then
there is also no need for the manager. Managers who lack options from which to
choose will be forced to justify their existence by introducing arbitrary options.
G ENER AL G UIDELINES 367
Without a backing project design, such contrived options always have poor results.
To avoid this danger, you must present management with a set of viable project
design options, preselected by you. For example, Chapter 11 investigated a total of
15 project design options, but the corresponding SDP review had only 4 options.
That said, do not overdo Optionality. Giving too many options upsets people, a
predicament known as the paradox of choice.1 This paradox is rooted in the fear
of missing out on some better option you did not choose, even if the option you
did choose was good enough.
COMPRESSION
Do not exceed 30% compression.
Whichever way you choose to compress the project, a 30% reduction in schedule is
the maximum compression you will likely see when starting from a sound normal
solution. Such highly compressed projects will probably suffer from high execution
and schedule risk. When you first begin using the project design tools and building
competency within your team, avoid solutions with more than 25% compression.
Compression reveals the true nature and behavior of the project, and there is
always something to gain by better understanding your own project. Compression
allows you to model the project’s time–cost curve, and obtaining formulas for cost
and risk is helpful when you are required to assess the effect of schedule changes.
It is immensely valuable to be able to quickly and decisively determine the likely
consequence of a change request. The alternative is gut feel and conflict.
1. Barry Schwartz, The Paradox of Choice: Why More Is Less (Ecco, 2004).
368 C HAPTER 14 C ONCLUDING THOUGHTS
Even if you suspect that an incoming request is unreasonable, saying “no”— especially
to a person of authority and power—is not conducive to your career. The only way
to say “no” is to get “them” to say “no.” By showing the quantified effects on sched-
ule, cost, and risk, you immediately bring to the surface what before you could only
intuit, enabling an emotion-free, objective discussion. In the absence of numbers and
measurements, anything goes. Ignorance of reality is not a sin, but malpractice is. If
decision makers are aware of numbers that contradict their commitments to custom-
ers and still persist with those commitments, they are perpetrating fraud. Because
such liability is unacceptable, in the presence of hard numbers, they will find ways of
rescinding their commitments or changing previously “unchangeable” dates.
When relying on top resources, proper project design is essential to know where
to apply them. As appealing as it may be, compressing with top resources may
backfi re. To begin with, top talent is typically scarce, so the top resources you
require to meet your commitments may not be available. Waiting for them cre-
ates delays and defeats the purpose of the compression. Even when available, top
resources may make things worse because leveraging them to compress the critical
path could make a new critical path emerge. Since you assign your resources based
on float and capabilities, you now run the risk that the worst developers will be
working on that new critical path.
Even when assigned to formerly critical activities, the top resources often are idle,
waiting for other activities and developers in the project to catch up. This reduces
the project’s efficiency. To avoid this situation, you may need a larger team that
can compress other paths by working in parallel. Such an increase in team size will
reduce efficiency and increase the cost. Finally, compressing using top resources
often requires two or more such heroes to compress multiple critical or near-criti-
cal paths to see any benefit from the compression.
When assigning top resources, you should avoid doing so blindly (such as assign-
ing the top resource to all current critical activities). Evaluate which network
path would benefit the most from the resources, determine the effect on other
paths, and even try combinations across chains. You may have to reassign the top
resource several times based on the changes to the critical path. You should also
look at activity size as well as the criticality. For example, you may have a large,
noncritical activity with a high level of uncertainty that could easily derail the
project. Assigning the top resources there will reduce that risk and ultimately help
you meet your commitments.
G ENER AL G UIDELINES 369
While no project can be accelerated beyond its critical path, no such rule applies
to the front end. Look for ways of working in parallel at the front end on prepara-
tory or evaluation tasks. This would compress the front end (and thus the project)
without any change to the rest of the project. For example, Figure 14-1 shows a
project (the upper chart) with a long front end. The front end contains a few cru-
cial technology and design choices that the architect had to settle before the rest
of the project could proceed. By hiring a second architect as a contractor for two
of these decisions, the front end duration was reduced by a third (the lower chart
in Figure 14-1).
Architect
10 Other Resources
8
Staffing
0
09/09 09/30 10/21 11/11 12/16 01/16 02/06 02/27 03/20 05/12 07/14
Date
10
Architect
Other Resources
8
Staffing
0
09/09 09/30 10/21 11/11 12/16 01/16 02/06 02/27 03/20 05/12 07/14
Date
The risk index indicates whether the project will break down when it hits the
fi rst obstacle or whether the project can leverage that obstacle to introduce refi ne-
ments, adapting to make the design a better approximation of reality. Having
sufficient float (indicated by the low risk) gives you a chance to thrive in the face
of the unforeseen.
I also fi nd that the project’s need for float is as much psychological as it is physi-
cal. The physical need is clear: You can consume float to handle changes and shift
resources. The psychological need is the peace of mind of all involved. In projects
with enough float, people are relaxed; they can focus and deliver.
While some of these activities could take place in parallel to other activities, the
activities in system design and project design do have interdependencies. The next
logical step is to design your project design using a simple network diagram and
even calculate the total duration of the effort. Figure 14-2 shows such a network
diagram of the design of project design. You can identify the likely critical path
using typical durations for the activities. If a single architect is designing the proj-
ect, then the diagram will actually be a long string. If the architect has someone
helping, or if the architect is waiting for some piece of information, the diagram
suggests activities to do in parallel.
Activities 6, 7, 8, 9, 10, 11, and 12 in the list (shown in blue in Figure 14-2) are
specific project design solutions. You can further break down each of those into
this list of tasks:
21
8
7 20
3 4 6 9 10 11 12 13
1 2 19 17
14
18 16
15
IN PERSPECTIVE
In any system it is important to distinguish between effort and scope. The archi-
tecture in a software system must be all-encompassing both in scope and in time.
It must include all required components, and it must be correct at the present time
and in the far future (as long as the nature of the business does not change). You
must avoid the very expensive and destabilizing changes that are the result of a
flawed design. When it comes to the effort, the architecture should be very limited.
Part 1 of this book explained how you can come up with a solid, volatility-based
decomposition in a few days to a week, even for a large system. Doing so requires
knowing how to do things correctly, but it is certainly possible with practice and
experience.
I N P ERSPECTIVE 373
Finally, coding is the most time-consuming and the most limited in scope.
Developers should never code more than one service at a time, and they will spend
considerable time testing and integrating each service as well.
Figure 14-3 illustrates in a qualitative manner the scope versus effort for a software
project. You can see that scope and effort are literally inverses of each other. When
something is wider in scope, it is narrow in effort, and vice versa.
Scope
Architecture
Detailed Design
Coding/Testing/Integrating
Effort
Architecture
Subsystem Detailed
Design
Coding/Testing/Integrating
Time
Note that the subsystems are always designed and constructed in the context of the
existing architecture. The effort allocation in Figure 14-4 is still that of Figure 14-3.
374 C HAPTER 14 C ONCLUDING THOUGHTS
You may be able to compress the project and start working in parallel. Figure 14-5
shows two views of concurrent subsystem development aligned against the
timeline.
Architecture
Subsystem Detailed
Design
Coding/Testing/Integrating
Time Time
Which parallel life cycle you choose depends on the level of dependencies between
the subsystems of the architecture. In Figure 14-5, the life cycle on the right stag-
gers the subsystems to overlap on the timeline. In this case, you can start building
a subsystem once the implementation of the interfaces of the subsystem on which
it depends are complete. You can then work on the rest of the subsystem in parallel
to the previous one. You can even create fully parallel pipelines like the layout on
the left of Figure 14-5. In this case, you build each subsystem independently of and
concurrently with the other subsystems with minimum integration.
THE HAND-OFF
The composition and makeup of the team has a significant effect on project design.
Here, team composition refers specifically to the ratio of senior to junior develop-
ers. Most organizations (and even individuals) defi ne seniority based on years
of experience. The defi nition I use is that senior developers are those capable of
designing the details of the services, whereas junior developers cannot. Detailed
design takes place after the major architectural decomposition of the system into
services. For each service, the detailed design contains the design of the service
public interfaces or contracts, its messages and data contracts, and internal details
such as class hierarchies or security.
Note the defi nition of senior developers is not developers who can or know how to
do detailed design. Instead, senior developers are those capable of doing detailed
design, once you show them how to do so correctly.
THE H AND -O FF 375
JUNIOR HAND-OFF
When all you have are junior developers, the architect must provide the detailed
design of the services. This defi nes the junior hand-off between the architect and
the developers. The junior hand-off disproportionally increases the architect’s
workload. For example, in a 12-month project, some 3 to 4 months of the overall
duration could be spent simply on detailed design.
The architect’s detailed design work can take place in the front end or while devel-
opers are constructing some of the services. Both of these options are bad.
Coming up with the correct details of all the services up front is very demanding,
and seeing in advance how all the details across all services mesh together sets a
very high bar. It is possible to design a few services up front, but not all of them.
The real problem is that detailed design in the front end simply takes too long.
Management is unlikely to understand the importance of detailed design and will
cringe at the prospect of extending the front end to accommodate it. Consequently,
management will force handing off the architecture to junior developers and doom
the project.
Designing the services on the fly, in parallel to the developers who are constructing
services that the architect has already designed, could work. However, overload-
ing the architect with detailed design makes the architect a bottleneck and may
considerably slow down the project.
SENIOR HAND-OFF
Senior developers are essential to address the detailed design challenge. If not
already capable of doing so, with modest training and mentoring senior developers
can perform the detailed design work, allowing for a senior hand-off between the
architect and the developers.
With a senior hand-off, the architect can hand off the design soon after the SDP
review, providing only a general outline of the services using gross terms for inter-
faces or just suggesting a design pattern. The detailed design now takes place as
part of each individual service, and the architect just needs to review it and amend
as needed. In fact, the only reason to pay for additional senior developers is to
enable the senior hand-off. The senior hand-off is the safest way of accelerating
any project because it compresses the schedule while avoiding changes to the crit-
ical path, increasing the execution risk, or introducing bottlenecks. Since shorter
projects will cost less, it follows that senior developers actually cost less than
junior developers.
376 C HAPTER 14 C ONCLUDING THOUGHTS
Architecture Architect
Senior Developer
Developer
Once the detailed design of the services is complete, the junior developers can
step in and construct the actual services. However, any design refi nement, as triv-
ial as it may be, requires the junior developers to consult with the senior devel-
oper who designed that service. Once fi nished with each service construction, the
junior developers proceed to code review with the senior developers (not their
junior peers), followed by integration and testing with other junior developers.
Meanwhile, the senior developers remain busy with the detailed design of the next
I N P R ACTICE 377
batch of services. Each design is reviewed with the architect before hand-off to the
junior developers.
Working this way is the best and only way of mitigating the risks of the junior
hand-off. Clearly, it also requires meticulous project design. You must know
exactly how many services you can design in advance and how to synchronize the
hand-offs with the construction. You must also add explicit service detailed design
activities and even additional integration points to address the risk of extracting
the detailed design out of the services.
IN PRACTICE
As with system design, when it comes to project design, you must practice. The
basic expectation of professionals—from lawyers to doctors to pilots—is that they
know their trade by heart and that they keep at it. Under fi re, everybody sinks to
their level of training. Unfortunately, unlike system design, hardly any software
architect is even aware of project design or is trained in it, even though project
design is both critical to success and, as discussed in Chapter 7, the software archi-
tect’s responsibility.
Compounding the need for project design practice are two additional issues.
First, project design is a vast topic. This book covers the core body of knowledge
required of modern software architects, both system design and project design. In
terms of its page count, project design outweighs system design by 2:1. You should
now have a feeling that you are peering into a deep rabbit hole. You cannot inter-
nalize and correctly use the concepts of this book without training and practice.
Figuring out project design by designing real projects on the job not only is asking
for trouble, but also defies common sense. Would you like to be the fi rst patient of
a doctor fresh out of medical school? Would you like to fly with a new pilot? Are
you proud of your fi rst program?
Second, project design, in many cases, produces non-intuitive results. You will have
to practice not just to master a massive body of knowledge, but also to develop a
new intuition. The good news is that project design skills can be acquired, as is
evident by the swift and marked improvement in the quality of the project designs
and the success rate of those who do practice.
normal solutions for your practice systems. Then, build from there to fi nd the best
solution as far as schedule, cost, and risk.
Examine your own past projects. With the advantage of hindsight, try to recon-
struct the project design that took place and contrast it with what should have
been done. Identify the planning assumptions, the classic mistakes, and the right
decisions. Prepare for an SDP review by listing all the solutions you would have
proposed if you could. Look at your current project. Can you list the activities,
come up with the correct estimations based on what the team is presently doing,
and calculate the true schedule and cost? What is the current risk level? What is
required to decompress the project? What level of compression is feasible?
When you think you have got it right, raise the bar again and fi nd ways of improv-
ing these designs. Never rest on your laurels. Develop new techniques, refi ne your
own style, and become a passionate expert and advocate of project design.
The debrief topics depend on what you deem important and what needs improve-
ment. They may include the following considerations:
• Estimations and accuracy. For each activity, ask yourself how accurate the ini-
tial estimation was when compared with the actual duration, and how many
times you had to adjust the estimations and in which direction. Is there a notice-
able pattern that you could incorporate in future projects to improve the estima-
tions? Review the initial list of activities to see what you missed and what was
superfluous. Calculate the extent to which the errors in the estimations canceled
each other out.
• Design efficacy and accuracy. Compare the accuracy of the initial broad project
estimation with the detailed project design and the actual duration and cost.
A BOUT Q UALIT Y 379
How accurate was your assessment of the throughput of the team? Was risk
decompression necessary, and if so, was it too much or too little? Finally, was
the compressed project doable, and how did the project manager and the team
handle complexity?
• Individual and team work. How well did the team members work as a team or
individually? Were there any bad apples? Can you make the team more produc-
tive in the future by using better tools or technology? Did the team communi-
cate issues in a timely manner? How well did the team members understand the
plan and their role in it?
• What to avoid or improve next time. Compile a prioritized list of all the mis-
takes or troubles encountered across people, process, design, and technology.
For each item, identify how you could have detected it sooner or avoided it in
the fi rst place. List both actions that caused problems and actions that should
have taken place. You should also include near-misses that did not end up caus-
ing harm.
• Recurring issues from previous debriefs. One of the best ways to improve is to
avoid past mistakes and prevent known problems from happening. It is detrimen-
tal to everyone when the same mistakes appear in project after project. There is
likely a very good reason why the same problem is recurrent. Nonetheless, you
must eliminate recurring mistakes in spite of the challenges.
• Commitment to quality. What level of commitment to quality was missing or
present? How intimately related was it to success?
It is important to debrief even successful projects that have met their commitments.
You must know if you have succeeded just because you were lucky or because you
had a viable system and project design. Even when the project is a success, could
you have done a better job? What should you do to sustain the things you did right?
ABOUT QUALITY
In the abstract, everything in this book is about quality. The very purpose of hav-
ing a sound architecture is to end up with the least complex system possible. This
provides for a higher-quality system that will be easier to test and maintain. There
is no denying it: Quality leads to productivity, and it is impossible to meet your
schedule and budget commitments when the product is rife with defects. When the
team spends less time hunting problems, the team spends more time adding value.
Well-designed systems and projects are the only way to meet a deadline.
With any software system, quality hinges on having a project design that includes the
crucial quality-control activities as an integral part of the project. Your project design
380 C HAPTER 14 C ONCLUDING THOUGHTS
must account for the quality control activities both in time and in resources. Do not
cut corners if your project design goal is to build the system quickly and cleanly.
A side effect of project design is that well-designed projects are low-stress projects.
When the project has the time and the resources it requires, people are confident
in their own ability and in their project’s leadership. They know the schedule is
doable and that every activity is accounted for. When people are less stressed, they
pay attention to details, and things do not fall between the cracks, resulting in
better quality. In addition, well-designed projects maximize the team’s efficiency.
This contributes to quality by allowing the team to more readily identify, isolate,
and fi x defects in the least costly way.
Your system and project design effort should motivate the team to produce the
highest-quality code possible. You will see that success is addictive: Once people
are exposed to working correctly, they take pride in what they do and will never
go back. No one likes high-stress environments afflicted by low quality, tension,
and accusations.
QUALITY-CONTROL ACTIVITIES
Your project design should always account for quality-control elements or activi-
ties. These include the following:
• Service-level testing. When estimating the duration and effort of each service,
make certain the estimation includes the time needed to write the test plan for
the service, to run the unit test against the plan, and to perform integration
testing. If relevant, add the time to roll the integration testing into your regres-
sion testing.
• System test plan. The project must have an explicit activity in which qualified
test engineers write the test plan. This includes a list of all the ways to break the
system and prove it does not work.
• System test harness. The project must have an explicit activity in which quali-
fied test engineers develop a comprehensive test harness.
• System testing. The project must have an explicit activity in which the software
quality-control testers execute the test plan while using the test harness.
• Daily smoke tests. As part of the indirect cost of the project, on a daily basis,
you must do a clean build of the evolving system, power it up, and (figuratively)
flush water down the pipes. This kind of smoke test will uncover issues in the
plumbing of the system, such as defects in hosting, instantiation, serialization,
connectivity, timeouts, security, and synchronization. By comparing the result
with the previous day’s smoke test, you can quickly isolate plumbing issues.
A BOUT Q UALIT Y 381
• Indirect cost. Quality is not free, but it does tend to pay for itself because defects
are horrendously expensive. Make sure to account correctly for the required
investments in quality, especially when it is in the form of indirect cost.
• Test automation scripting. Automating the tests should be an explicit activity
in the project.
• Regression testing design and implementation. The project must have compre-
hensive regression testing that detects destabilizing changes the moment they
happen across the system, subsystems, services, and all possible interactions.
This will prevent a ripple effect of new defects introduced by fi xing existing
defects or simply making changes. While executing regression testing on an
ongoing basis is often treated as an indirect cost, the project must contain activ-
ities for writing the regression testing and its automation.
• System-level reviews. Chapter 9 discussed the need to engage in extensive peer
reviews at the service level. Since defects can occur anywhere, you should extend
reviews to the system level. The core team and the developers must review the
system requirements spec, the architecture, the system test plan, the system test
harness code, and any additional system-level code artifacts. Both with service
and system reviews, the most effective and efficient reviews are structured in
nature2 and have designated roles (moderator, owner, scribe, reviewers), as well
as follow-ups to ensure the recommendations are applied across the system. At a
minimum, the team should hold informal reviews that involve walking through
these artifacts with one or more peers. Regardless of the method used, these
reviews require a high degree of mutual involvement along with team spirit of
commitment for quality. The reality is that delivering high-quality software is
a team sport.
This list is only partial. The objective here is not to provide you with all the
required quality-control activities, but rather to get you to think about all the
things you must do in your project to control quality.
QUALITY-ASSURANCE ACTIVITIES
Your project design should always account for quality-assurance activities. Previous
chapters (especially Chapter 9) have already discussed quality assurance, but you
should add the following quality assurance activities to your process and your
project design:
2. https://en.wikipedia.org/wiki/Software_inspection
382 C HAPTER 14 C ONCLUDING THOUGHTS
sending the developers to training (or bringing the training in-house), you
instantly eliminate many defects due to learning curves or lack of experience.
• Authoring key SOPs. Software development is so complex and challenging that
nothing should be left to chance. If you do not have standard operating proce-
dures (SOPs) for all key activities, devote the time to researching and writing
them.
• Adopting standards. Similar to SOPs, you must have a design standard (see
Appendix C) and a coding standard. By following best practices, you will pre-
vent problems and defects.
• Engaging QA. Actively engage a true quality-assurance person. Have that
person review the development process, tune it to assure quality, and create a
process that is both effective and easy to follow. This process should support
investigation and elimination of the root cause of defects or, even better, should
proactively prevent problems from happening in the fi rst place.
• Collecting and analyzing key metrics. Metrics allow you to detect problems
before they happen. They include development-related metrics such as esti-
mation accuracy, efficiency, defects found in reviews, quality and complexity
trends, as well as run-time metrics such as uptime and reliability. If required,
devise the activities to build the tools that collect the metrics, and account for
the indirect cost of collecting and analyzing them on a regular basis. Back it up
with a SOP that mandates acting on abnormal metrics.
• Debriefing. As described in the previous section, debrief your work as you prog-
ress, and debrief the project as a whole once it is completed.
The best way of turning this dynamic around is by infecting the team with a
relentless obsession for quality. When totally committed to quality, the team will
drive every activity from a perspective of quality, fi xing the broken culture and
creating an atmosphere of engineering excellence. To reach this state, you must
provide the right context and environment. In practice, this means doing every-
thing in this book—and more.
A BOUT Q UALIT Y 383
385
This page intentionally left blank
A
PROJECT TRACKING
One of the most misunderstood quotes in history is attributed to Field Marshal
Helmuth von Moltke, the Elder: “No battle plan survives contact with the enemy.”
Ever since, this statement has been taken out of context as a justification for no
planning at all—the complete opposite of its original intent. Von Moltke, known
as the architect of the Franco-Prussian War of 1870, was a military planning genius
credited with a series of stunning military victories. Von Moltke realized that the
key to success in the face of rapidly changing circumstances is to not rely on a single
static plan. Instead, you must have the flexibility to pivot quickly between several
meticulously laid-out options. The purpose of the initial plan is merely to provide a
fighting chance by aligning the available resources with the objective as best as pos-
sible. From that point onward, one must constantly track against the plan and revise
it as needed, often by coming up with variations of the current plan, switching to an
alternative preplanned option, or devising new options altogether.
In the context of system and project design, von Moltke’s insight is as relevant
today as it was 150 years ago. The project design techniques in this book support
two objectives. The fi rst objective is to drive an educated decision during the SDP
review, to ensure the decision makers choose a viable option. Such an option serves
as a good-enough starting point coming into execution, allowing for a fighting
chance. The second objective for project design is to adapt the plan during execu-
tion. The project manager must constantly correlate what is actually going on with
the plan, and the architect needs to use the project design tools to redesign the
project to respond to reality. This often takes the form of modest project redesign
iterations. You want to avoid any gross corrections, and instead drive the project
smoothly using numerous small corrections. Otherwise, the degree of correction
required may be wrenching and cause the project to fail.
A good project plan is not something you sign off and fi le in a drawer, never again
to see the light of day. A good project plan is a live document that you constantly
revise to meet your commitments. This requires knowing where you are with
respect to the plan, where you are heading, and what corrective actions to take
in response to changing circumstances. This is what project tracking is all about.
387
388 A PPENDIX A P ROJECT TR ACKING
Project tracking is part of project management and execution and is not the
responsibility of the software architect. I therefore include project tracking in this
book, but as an appendix to the main discussion of system and project design.
SRS Some
SRS
Review Construction
Detailed
STP
Design
STP Design
Review Review
Test
Construction
Client
Code
Integration
Review
Testing
Each service starts with a service requirement spec (SRS). This can be brief, as
little as a few paragraphs or pages outlining what the service is required to do.
The architect needs to review the SRS. With the SRS in place, the developer can
proceed to write a service test plan (STP) listing all the ways the developer will
later demonstrate the service does not work. Even with a senior hand-off, when the
developer is capable of performing the detailed design of the service, the developer
cannot always start the detailed design without gaining additional insight into the
nature of the service. The best way of obtaining that insight is via some construc-
tion to get a fi rst-hand understanding of what the technology can provide or what
the available detailed design options are. Armed with that insight, the developer
can proceed to design the details of the service, which the architect then reviews
(perhaps with others). Once the detailed design is approved, the developer can
construct the code for the service. In tandem with the construction of the service,
the developer builds a white-box test client. This test client enables the devel-
oper to test every parameter, condition, and error-handling path by invoking the
debugger on the evolving code. With the code complete, the developer reviews the
code with the architect and the other developers, integrates the service with other
services, and fi nally performs black-box unit testing against the test plan.
Note that each review task in the diagram must complete successfully. A fail-
ing review causes the developer to repeat the preceding internal task. For clarity,
Figure A-1 does not show these retries.
For example, the Detailed Design phase may include some construction, the
detailed design itself, and the design review. The Construction phase may
include the actual construction, the test client, and the code review.
SRS Some
SRS
Review Construction
Test Plan
Detailed
STP
Design
STP Design
Review Review
Construction
Test
Construction
Client
Integration
Code
Integration
Review
Testing
PHASE WEIGHT
While each activity may have multiple phases, these phases may not contribute
equally to the completion of the activity. You need to assess the contribution of
a phase in the form of a weight—in this case, a percentage. For example, con-
sider the activity with the phases listed in Table A-1. In this sample activity, the
Requirements phase counts for 15% for the completion of the activity, while the
Detailed Design phase counts for 20% of the completion.
You can allocate the weight of the phases in several ways. For example, you can
estimate the importance of the phase, or you can estimate the duration in days for
each phase and divide by the sum of all phases. Alternatively, you can just divide
by the number of phases (e.g., with 5 phases, each phase counts as 20%), or you
can even consider the type of the activity. For example, you may decide that the
A CTIVIT Y L IFE C YCLE AND S TATUS 391
Requirements phase will be weighted 40% for the UI activity and only 10% for
the Logging activity.
Requirements 15
Detailed Design 20
Test Plan 10
Construction 40
Integration 15
Total 100
For accurate tracking, it does not matter much which technique you use to allocate
the weight of the phases as long as you apply the technique consistently across all
activities. In most decent-size projects, you will end up with hundreds of phases
across all activities. On average, any discrepancies in assigning weights will cancel
each other.
ACTIVITY STATUS
Given the binary exit criterion and the weight of each phase, you can calculate the
progress of each activity at any point in time. With tracking, progress is the com-
pletion status of an activity (or of the entire project) as a percentage.
m
A(t) = ∑ Wj
j =1
where:
The progress of the activity at the time t is the sum of the weights of all the phases
that are completed by the time t. For example, using Table A-1, if the fi rst three
392 A PPENDIX A P ROJECT TR ACKING
Similarly to calculating the progress of an activity, you can and should keep track
of the effort spent on each activity. With tracking, effort is the amount of direct
cost spent on the activity (or on the entire project) as a percentage of the estimated
direct cost for the activity (or for the entire project). The formula for the effort of
an activity is:
S(t)
C(t) =
R
where:
Note Both progress and effort are unitless: They are percentages. This
enables you to avoid specific values and compare them in the same analysis.
PROJECT STATUS
The formula for the progress of the project is:
∑(E * A (t))
i i
P(t) = i=1
N
∑E
i=1
i
where:
The overall project progress at the time t is a ratio between two sums of estima-
tions. The fi rst is the sum of all the estimated duration of each individual activity
multiplied by the activity’s progress. The second is the sum of all activity estima-
tions. Note that this simple formula provides the progress of the project across all
activities, developers, life cycles, and phases.
To illustrate this point, consider the simple project in Table A-2. Suppose at the
time t the UI activity is only 45% complete. Since 45% of 20% is 9%, the work
done so far in the UI activity has earned 9% toward the completion of the project.
In much the same way, you can calculate the actual earned value of all activities
in the project at time t.
UI 40 20 45 9
Manager Service 20 10 0 0
Utility Service 40 20 0 0
System Testing 30 15 0 0
Summing up the actual earned value of all activities in Table A-2 reveals that the
project is 40.25% complete at time t. This is the same value produced by the
progress formula:
ACCUMULATED EFFORT
The formula for the effort of the project is:
N N ⎛ Si (t)⎞⎟ N
∑R
i=1
i ∑R i=1
i ∑R
i=1
i
where:
The overall project effort is simply the sum of direct cost spent across all activities
divided by the sum of all direct cost estimations of all activities. This provides
effort as the overall direct cost expenditure as a percentage of the planned direct
cost of the project.
Again note the similarity of the project effort to the planned earned value formula.
If each activity is assigned to one resource, and the activities end up costing exactly
as planned and complete on the planned dates, then the effort curve will match
the planned earned value curve. If more (or less) than one resource is planned per
activity, then you will have to track the effort against its own planned direct cost
curve. However, in most projects the two curves should match closely. For sim-
plicity’s sake, the rest of this appendix assumes that each activity is planned for
one resource.
Since indirect cost is independent of both the progress and effort of the project,
tracking indirect cost is not terribly useful. All you are likely to see is a straight line
going up, which does not help to suggest any corrective actions.
Tracking indirect cost is helpful, however, in one case: when reporting the total
cost of the project to date, in which case you should add the indirect cost to the
direct cost. The rest of this chapter looks at only the accumulated direct cost (the
effort) when tracking the project and comparing it with the plan.
100
90
80
70
60
% 50
40
30
20 Plan
Progress
10 Effort
0
11/09 12/19 01/28 03/09 04/18 05/30
Date
The blue line in Figure A-3 shows the planned earned value of the project. The
planned earned value should have been a shallow S curve; you will see shortly
why it deviated from that form in this example. To the point in time shown on the
396 A PPENDIX A P ROJECT TR ACKING
graph, the green line shows the actual progress of the project (the actual earned
value) and the red line illustrates the effort spent.
PROJECTIONS
Project tracking allows you to see exactly where the project is and where it has
been. The real question, however, is not what the current status of the project is,
but rather where the project is heading. To answer that question, you can project
the progress and effort curves. Consider the generic project view of Figure A-4.
Plan
Projected Progress
Completion Effort
Effort
5
Projected
Cost
Overrun
%
100
Planned 3
Completion
Effort
Projected
Planned Schedule Projected
Completion Overrun Completion
Date Date
1 4
Time
For simplicity, Figure A-4 replaces the shallow S curves with their linear regression
trend lines, shown as solid lines in the figure. The blue line represents the planned
earned value of the project. Ideally the green progress line and the red effort line
should match the blue line. The project is expected to complete when the planned
earned value reaches 100%, point 1 in Figure A-4. However, you can see that the
green line (actual progress) is below the plan.
P ROJECTIONS 397
If you extrapolate the green progress line, you get the dashed green line in Figure
A-4. You can see that by the time of point 1, the projected progress line reaches
only about 65% of completion (point 2 in Figure A-4). The project will actually
complete when the projected progress line reaches 100%, or point 3 in Figure A-4.
The time of point 3 is point 4 in Figure A-4, and the difference between points 4
and 1 is the projected schedule overrun.
Much the same way, you can project the measured effort line and fi nd point 5 in
Figure A-4. The difference in effort between points 5 and 3 in Figure A-4 is the
projected direct cost overrun (in percentage) of the project.
Note Since the indirect cost is usually linear with time, the projected
schedule overrun in percentage also indicates the projected indirect cost
overrun.
Suppose this is a year-long project, and you measure the project on a weekly inter-
val. A month into the project you already have four reference points, enough to
run a regression trend line that is well fitted to the measured progress and effort.
Recall from Chapter 7 that the pitch or slope of the earned value curve represents
the throughput of the team. Therefore, a month into a year-long project, you already
have a good idea where the project is heading via a projection that is highly calibrated
to the actual throughput of the team. The initial planned earned value was just
that—initial. The projected progress and effort lines are what will likely happen.
Figure A-5 shows the actual projections for Figure A-3. Given the terms of the
projection, the project is expected to have about a month schedule slip (or 13%)
and some 8% effort overrun.
In Figure A-3 and Figure A-5, the planned earned value is a truncated shallow S
curve because tracking started for this project after the SDP review. By deliber-
ately dropping the very shallow start of the plan, the linear trend line projections
become a better fit to the curves.
120
110 108%
100
90
80
70
% 60
50
40
Plan
30 Progress
20 Effort
Projected Progress
10
Projected Effort
06/25
0
11/09 12/19 01/28 03/09 04/18 05/30
Date
ALL IS WELL
Consider the progress and effort projections of Figure A-6. In the figure, the pro-
jected progress and effort lines coincide with the plan, and the project is poised to
deliver on its commitments. You need do nothing about this state of affairs; there
is no need to help or try to improve matters. Knowing when not to do something
is as important as knowing when to do something.
P ROJECTIONS AND C ORRECTIVE A CTIONS 399
Plan
Progress
Effort
Date
The only way to meet the deadline at the end is to be on time throughout the
project.
Staying on the original plan (or on a revised plan) will never happen on its own
and requires constant tracking by the project manager and numerous corrective
actions throughout the project execution. You must respond to the information
revealed by the trajectory of the projections and avoid opening a gap between the
progress, the effort, and the plan.
400 A PPENDIX A P ROJECT TR ACKING
UNDERESTIMATING
Consider the earned value and effort projections of Figure A-7. This project is
clearly not doing well. Progress is below the plan, while effort is above the plan.
The likely explanation is that you have underestimated the project and its activities.
Plan
Progress
Effort
Date
Corrective Actions
There are two obvious corrective actions when dealing with underestimation. The
fi rst is to revise the estimations upward based on the (now known) throughput of
the team. In fact, you can see when the projected progress line reaches 100%, and
that point in time becomes the new completion date of the project. Effectively,
you will be pushing down the blue plan line until it meets the green progress line.
This is a typical remedy in a feature-driven project, where you must achieve parity
with a competing product or a legacy system, and there is no point in releasing the
system while missing key aspects.
However, pushing the deadline out will not do in a date-driven project where you
must release on a set date. In this case, you should take the second type of correc-
tive action: reduce the scope of the project. By reducing the scope, the earned value
of what the team has produced so far counts more, and the green progress line will
come up to meet the blue plan line.
You can certainly apply a combination of pushing the deadline and reducing the
scope, and the progress projection will tell you exactly how much or how little of
P ROJECTIONS AND C ORRECTIVE A CTIONS 401
each remedy is required. Whichever response you choose will require redesigning
the project.
Sadly, the knee-jerk reaction of many who do not wish to compromise on the
deadline or the scope is to throw more people on the project. As Dr. Fredrick
Brooks observed, this is like trying to put out a fi re by dousing it with gasoline.1
There are several reasons why adding people to a late project almost always makes
matters much worse. First, even if adding people brings the green progress line
closer to the blue plan, it will make the red effort line shoot up. It does not make
sense to supposedly fi x one aspect of the project by breaking another (especially if
the project is already using more people than planned, as in Figure A-7). Second,
you will have to onboard and train the new people. This requires interrupting the
other team members, who often are the most qualified individuals and, impor-
tantly, are likely on the critical path; halting or slowing down their work will
mean incurring a further delay to the project. You will end up paying both for the
ramp-up time for the new people and for the time lost from the existing team who
are assisting with the onboarding. Finally, even without the onboarding cost, the
new team will be larger—and hence less efficient.
There is one exception to this rule, which is near the project’s origin. At the begin-
ning you can invest in wholesale onboarding of the team members. More impor-
tantly, you can get away with adding people at the origin because you can pivot to
a more aggressive, compressed project design solution. Such a solution typically
does require additional resources due to the parallel work. Note that compressing
the project will introduce a higher level of risk and complexity, so you need to
carefully weigh the full effect of the new solution.
RESOURCE LEAK
Consider the progress and effort projections of Figure A-8. In this project, both the
progress and the effort are under the plan, and progress is even under the effort.
This is often the result of resource leaks: People are assigned to your project but
they are working on someone else’s project. As a result, they cannot spend the
required effort, and progress lags further. Resource leaks are endemic in the soft-
ware industry, and I have observed leaks as high as 50% of the effort.
Plan
Progress
Effort
Date
Corrective Actions
When identifying a resource leak, the natural instinct is to simply plug the leak.
Plugging the leak, however, will tend to backfi re: It could detonate the other proj-
ect while making you the culprit. The best resolution is to call a meeting between
the project manager of the project into which your team is leaking, yourself, and
the lowest-level manager responsible for both. After showing the projections chart
(such as Figure A-8), you present the overseeing manager with two options. If
the other project is more important than your own project, then the green line
in Figure A-8 represents what your team can produce under these new circum-
stances, and the deadline must move to accommodate that. But if your project is
more important, then the project manager of the other team must immediately
revoke all source control access to your team members and perhaps even assign
a few of the other project’s top resources to your project to compensate for the
damage already done. By presenting the resolution options this way, whatever the
manager decides, you win and regain the chance to meet your commitments.
OVERESTIMATING
Consider the progress and effort projections of Figure A-9. While it may look like
this project is doing very well because progress is above plan, in reality the proj-
ect is in danger due to overestimating. As discussed in Chapter 7, overestimating
is just as deadly as underestimating. An additional problem with the project in
P ROJECTIONS AND C ORRECTIVE A CTIONS 403
Figure A-9 is that the project is spending more effort than what the plan called for.
This may be because too many people were assigned to the project or because the
project is working in an unplanned parallel manner.
Plan
Progress
Effort
Date
Corrective Actions
One simple corrective action for overestimating is to revise the estimations down-
ward and bring the deadline in. The blue plan line in Figure A-9 will then come
up to meet the green progress line, and you can calculate by how much to do just
that. Unfortunately, bringing the deadline in likely has only downsides. Often
delivering the system ahead of schedule has no benefits. For example, the customer
may not pay until the agreed deadline, or the servers may not be ready, or the
team might have nothing to do next. At the same time, reducing the duration will
increase the pressure on the team. The way people respond to pressure is nonlin-
ear. Some modest pressure may have positive results, whereas excessive pressure
demotivates. If the team members respond to the pressure by giving up, the project
implodes. It is usually hard to know where that thin line is.
Another corrective action is to keep the deadline where it is but revise the estima-
tions downward and over-deliver by increasing the scope of the project. Adding
things to do (perhaps starting work on the next subsystem) will reduce the actual
earned value and the green progress line will come down to meet the blue plan
line in Figure A-9. Adding value is always a good thing, but it does carry the
over-pressure risk.
404 A PPENDIX A P ROJECT TR ACKING
The best way of fi xing overestimation is to release some of your resources. When
you do so, the red effort line will go down since a smaller team costs less. The green
progress line will come down because the smaller team has a reduced throughput.
The smaller team should also be more efficient. If you detect the overestimation
early enough, you can even choose a less compressed project solution.
MORE ON PROJECTIONS
The projections allow you to analyze where the project is heading long before an
underlying problem becomes severe. Examine Figure A-4 again. Waiting until
the project reaches point 2 in the figure and then correcting it up to the blue line
requires a painful, if not devastating maneuver. Using projections, you can detect
the trend much earlier, and perform a smaller correction before any significant
gap appears between the lines. The earlier the action, the more time it has to take
effect, the less disruptive it is to the rest of the project, the easier it is to run it past
management, and the more likely it is to succeed. It is always better to be proactive
than reactive, and an ounce of prevention is often worth a pound of cure.
Very much like driving a car, in your project execution you make frequent small
corrections as opposed to a few drastic ones. Good projects are always smooth,
whether in the planned earned value, the staffing distribution chart or, as in this
case, the progress and effort lines.
Note that the technique shown here is analyzing the trend of the project, not the
actual project. This is the correct way of driving the project. To use the car anal-
ogy again, you do not drive your car forward by looking down at the pavement or
looking strictly in the rear-view mirror. Where the car is now or where it has been
is largely irrelevant for driving it forward. You drive your car looking at where the
car is going to be and taking corrective actions against that projection.
Combining projections with project design is the ultimate way of handing unan-
ticipated changes to the scope of the project. When anyone tries to increase (or
decrease) the scope of the project and asks for your approval or consent, you
should politely ask to get back to them with your answer. You now need to rede-
sign the project to assess the consequences of the change. This redesign could be
minor if the change does not affect the critical path or the cost and is within the
capability of the team. Use the projections to judge your ability to deliver on the
new plan from the perspective of the actual throughput and cost. Of course, the
change could extend the duration of the project and increase the cost and demand
for resources. You may have to choose another project design option, or even
devise new project design options altogether.
When you get back to management, present the new duration and total cost that
the change requires, including the new projections, and ask if they want to do it.
If they cannot afford the new schedule and cost implications, then nothing really
changed. If they accept them, then you have new schedule and cost commitments
for the project. Either way, you will always meet your commitments. These com-
mitments may not be the original ones the project started with, but then again, you
are not the one who changed the plan.
BUILDING TRUST
Most software teams fail to meet their commitments. They have given man-
agement no reason to trust them and every reason to distrust them. As a result,
management dictates impossible deadlines, while fully expecting them to slip. As
discussed in Chapter 7, aggressive deadlines drastically reduce the probability of
success, manifesting failure as a self-fulfi lling prophecy.
Project tracking is a good way of breaking that vicious cycle. You should share the
projections with every possible decision maker and manager. Constantly show the
project’s current benign status and the future trends. Demonstrate the ability to
detect problems months before they raise their head. Insist on (or just take) correc-
tive actions. All of these actions will establish you as a responsible, accountable,
trustworthy professional. This will lead to respect and eventually trust. When you
have gained the trust of those above you, they will tend to leave you alone to do
your work, allowing you to succeed.
This page intentionally left blank
B
SERVICE CONTRACT DESIGN
The fi rst part of this book addressed the system architecture: How to decompose
the system into its components and services and how to compose the required
behavior out of the services. This is not the end of the design and you must con-
tinue the process by designing the details of each service.
Detailed design is a vast topic, worthy of its own book. This appendix limits its
discussion of detailed design to the most important aspect of the design of a service:
the public contract that the service presents to its clients. Only after you have settled
on the service contract can you fill in internal design details such as class hierarchies
and related design patterns. These internal design details as well as data contracts
and operation parameters are domain-specific and, therefore, outside the scope of
this appendix. However, in the abstract, the same design principles outlined here for
the service contract as a whole apply even at the data contract and parameter levels.
This appendix shows that even with a task as specific to your system as the design
of contracts for your services, certain design guidelines and metrics transcend
service technology, industry domains, or teams. While the ideas in this appendix
are simple, they have profound implications for the way you go about developing
your services and structuring the construction work.
407
408 A PPENDIX B S ERVICE C ONTR ACT D ESIGN
Next, consider the design in Figure B-2. Is this a good design for your system? The
system design in Figure B-2 uses a huge number of small components or services
to implement the system (to reduce the visual clutter, the figure does not show
cross-service interaction lines). In theory, you could build any system this way by
placing every requirement in separate service. That, too, is not just a bad design,
but another canonical example of what not to do. As with the previous case, you
also cannot validate such a design.
Finally, examine the system design in Figure B-3. Is this a good design for your
system? While you cannot state that Figure B-3 is a good design for your system,
you could say that it is certainly a better design than a single large component or
an explosion of small components.
M ODUL ARIT Y AND C OST 409
Minimum Cost
Cost
Number of Services
Figure B-4 Size and quantity effect on cost [Image adopted and modified from Juval
Lowy, Programming .NET Components, 2nd ed. (O’Reilly Media, 2003); Juval Lowy,
Programming WCF Services, 1st ed. (O’Reilly Media, 2007); and Edward Yourdon and
Larry Constantine, Structured Design (Prentice-Hall, 1979).]
410 A PPENDIX B S ERVICE C ONTR ACT D ESIGN
When you build a system out of smaller building blocks such as services, you
have to pay for two elements of cost: the cost of building the services and the cost
of putting it all together. You can build a system at any point on the spectrum
between one large service and countless little services, and Figure B-4 captures the
effect of that decomposition decision on the cost of building the system.
Note Part 2 of this book discussed system cost as a function of time and
the design of the project. Figure B-4 shows another dimension—how the
system cost is a function of the architecture of the system and the granular-
ity of the services. Different architectures will have different time–cost–risk
curves.
INTEGRATION COST
The integration cost of services increases in a nonlinear way with the number of
services. This, too, is the result of complexity—in this case, the complexity of the
possible interactions. More services imply more possible interactions, adding more
complexity. As mentioned in Chapter 12, due to connectivity and ripple effects,
as the number of services (n) increases, complexity grows in proportion to n2 but
can even be on the order of nn. This interaction complexity directly affects the
integration cost, which is why the integration cost (the red line in Figure B-4) is
also a nonlinear curve. Consequently, at the far right side of Figure B-4, the inte-
gration cost shoots up ever higher as the number of services increases. In contrast,
at the far left side of the curve where there is perhaps only single large service, the
integration cost approaches zero since there is nothing to integrate.
S ERVICES AND C ONTR ACTS 411
What you must avoid are the edges of the chart, because these edges are nonlin-
early worse and become many multiples (even dozens of times) more expensive.
The challenge with building a nonlinearly more expensive system is that the tools
all organizations have at their disposal are fundamentally linear tools. The orga-
nization can give you another developer and then another developer, or another
month and then another month. But if the nature of the underlying problem is
nonlinear, you will never catch up. Systems designed outside the area of minimum
cost have already failed before anyone has written the fi rst line of code.
contract is an interface, not all interfaces are service contracts. Service contracts
are a formal interface that the service commits to support, unchanged.
To use an analogy from the human world, life is full of both formal and informal
contracts. An employment contract defi nes (often using legal jargon) the obliga-
tions of both the employer and the employee to each other. A commercial contract
between two companies defi nes their interactions as a service provider and a ser-
vice consumer. These are formal forms of interfacing, and the parties to the con-
tract often face severe implications if they violate the contract or change its terms.
In contrast, when you hail a taxi, there is an implied informal contract: The driver
will take you safely to your destination, and you will pay for this service. Neither
of you signed a formal contract describing the nature of that interaction.
CONTRACTS AS FACETS
A contract goes beyond being just a formal interface: It represents a facet of the
supporting entity to the outside world. For example, a person can sign an employ-
ment contract representing that person as an employee. That person could have
other facets, but the employer only sees and cares about that particular facet. A
person can sign additional contracts such as a land lease contract, a marriage con-
tract, a mortgage contract, and so on. Each one of these contracts is a facet of the
person: as an employee, as a landlord, as a spouse, or as a homeowner. Similarly,
a service can support more than one contract.
In reality, a single service can support multiple contracts, and multiple services
can support a specific contract. In these cases, the curves in Figure B-4 shift left to
right or up and down, but their behavior remains the same.
is a good contract?” Good contracts are logically consistent, cohesive, and inde-
pendent facets of the service. These attributes are best explained using analogies
from daily life.
Would you sign an employment contract that states you can only work at the
company so long as you live at a specific address? You would reject such a contract
because it is logically inconsistent to condition your employment status on your
address. After all, if you do the agreed-upon work to the expected standard, where
you live is irrelevant. Good contracts are always logically consistent.
Would you sign an employment contract that does not specify how much you are
paid? Again, you would reject it. Good contracts are always cohesive and contain
all the aspects required to describe the interaction—no more, no less.
Would you make your marriage contract dependent on your employment contract?
You would reject this contract because the independence of the contract is just as
important. Each contract or facet should stand alone and operate independently
of the other contracts.
These attributes also guide the process of obtaining the contract. Would you pay a
real estate lawyer to craft a contract just for you to rent your apartment? Or would
you search the web for an apartment rental contract, print the fi rst search result,
fi ll in the blanks with the address and the rent, and be done with it? If an online
contract is good enough for millions of other rentals without being specific to any
apartment (which would truly be a nontrivial achievement), would it not be good
enough for you? The contract must have evolved to include all the cohesive details
such as rent and to avoid the inconsistent things like where the renters work. It
must also be independent of other contracts—that is, a true stand-alone facet.
Note that you are not searching for a better contract than anyone else is using. You
simply want to reuse the very same contract that everyone else is using. It is precisely
because it is so reusable that it is a good contract. The final observation is that logi-
cally consistent, cohesive, and independent contracts are reusable contracts.
Note that reusability is not a binary trait of a contract. Every contract lies some-
where on the spectrum of reusability. The more reusable the contract, the more it
is logically consistent, cohesive, and independent. Imagine the contract in front of
the service in Figure B-1. That contract is massive, and it is extremely specific for
that particular service. It is certainly logically inconsistent because it is a bloated
dumping ground for everything that the system does. The likelihood that anyone
else in the world will ever reuse that service contract is basically zero.
414 A PPENDIX B S ERVICE C ONTR ACT D ESIGN
Next, imagine the contract on one of the tiny services in Figure B-2. That contract
is miniscule and extremely specialized for its context. Something so small cannot
possibly be cohesive. Again, the likelihood that anyone else will ever reuse that
contract is zero.
The services in Figure B-3 offer at least some hope. Perhaps the contracts on
the services in Figure B-3 have evolved to include everything pertaining to their
interactions—no more, no less. The small number of interactions also indicates
independent facets. The contracts could very well be reusable.
Figure B-5 Reusing interfaces [Figure inspired by Matt Ridley, The Rational Optimist:
How Prosperity Evolves (HarperCollins, 2010). Images: Mountainpix/Shutterstock; New
Africa/Shutterstock.]
FACTORING C ONTR ACTS 415
Our species has been reusing the “tool–hand” interface since prehistoric times.
While no grain of stone from the stone axe is reusable in the mouse, and no piece
of electronics from the mouse is useful in the stone axe, both reuse the same inter-
face. Good interfaces are reusable, while the underlying services never are.
FACTORING CONTRACTS
When designing the contracts for your services, you must always think in terms
of elements of reuse. That is the only way to assure that even after architecture
and decomposition, your services will remain in the area of minimum cost. Note
that the obligation to design reusable contracts has nothing to do with whether
someone will actually end up reusing the contracts. The degree of actual reuse
or demand for the contract by other parties is completely immaterial. You must
design the contracts as if they will be reused countless times in perpetuity, across
multiple systems including your current one and those of your competitors. A sim-
ple example will go a long way to demonstrate this point.
DESIGN EXAMPLE
Suppose you need to implement a software system for running a point-of-sale
register. The requirements for the system likely have use cases for looking up an
item’s price, integrating with inventory, accepting payment, and tracking loyalty
programs, among others. All of this can easily be done using The Method and the
appropriate Managers, Engines, and so on. For illustration purposes, suppose the
system needs to connect to a barcode scanner and read an item’s identifier with it.
The barcode scanner device is nothing more than a Resource to the system, so you
need to design the service contract for the corresponding ResourceAccess service.
The requirements for the barcode scanner access service are that it should be able
to scan an item’s code, adjust the width of the scanning beam, and manage the
communication port to the scanner by opening and closing the port. You could
defi ne the IScannerAccess service contract like so:
interface IScannerAccess
{
long ScanCode();
void AdjustBeam();
void OpenPort();
void ClosePort();
}
416 A PPENDIX B S ERVICE C ONTR ACT D ESIGN
You may feel content because you have reused the IScannerAccess service con-
tract across multiple services.
FACTORING DOWN
Sometime later, the retailer contacts you with the following issue: In some cases
it is better to use other devices, such as a numerical keypad, to enter item code.
However, the IScannerAccess contract assumes the underlying device uses
some kind of an optical scanner. As such, it is unable to manage non-optical
devices such as numerical keypads or radio frequency identification (RFID) read-
ers. From a reuse perspective, it is better to abstract the actual reading mecha-
nism and rename the scanning operation to a reading operation. After all, which
mechanism the hardware device uses to read the item code should be irrelevant to
the system. You should also rename the contract to IReaderAccess and ensure
there is nothing in the contract’s design that precludes all types of code readers
from reusing the contract. For example, the AdjustBeam() operation is mean-
ingless for a keypad. It is better to break up the original the IScannerAccess
into two contracts, and factor down the offending operation:
interface IReaderAccess
{
long ReadCode();
void OpenPort();
void ClosePort();
}
interface IScannerAccess : IReaderAccess
{
void AdjustBeam();
}
FACTORING C ONTR ACTS 417
FACTORING SIDEWAYS
With that change done, more time passes, and the retailer decides to have the soft-
ware also control the conveyer belt attached to the point-of-sale workstation. This
requires the software to start and stop the belt, as well as manage its communica-
tion port. While the conveyer belt uses the same kind of communication port as
the reading devices, the belt cannot reuse IReaderAccess because the contract
does not support a conveyer belt, and the belt cannot read codes. Furthermore,
there is a long list of such peripheral devices, each with its own functionality, and
the introduction of every one of them will duplicate parts of the other contracts.
Observe that every change in the business domain leads to a reflected change in
the system’s domain. This is the hallmark of a bad design. A good system design
should be resilient to changes in the business domain.
interface ICommunicationDevice
{
void OpenPort();
void ClosePort();
}
interface IReaderAccess
{
long ReadCode();
}
Note that the sum of work inside BarcodeScanner is exactly the same as with
the original IScannerAccess . However, because the communication facet
is independent of the reading facet, other entities (such as belts) can reuse the
ICommunicationDevice service contract and support it:
interface IBeltAccess
{
void Start();
void Stop();
}
class ConveyerBelt : IBeltAccess,ICommunicationDevice
{...}
The real issue with the point-of-sale system was not the specifics of the reading
devices, but rather the volatility of the type of devices connected to the system.
Your architecture should rely on volatility-based decomposition. As this simple
example shows, the principle extends to the contract design of individual services
as well.
C ONTR ACT D ESIGN M ETRICS 419
FACTORING UP
Factoring operations into separate contracts (like ICommunicationDevice out
of IReaderAccess) is usually called for whenever there is a weak logical relation
between the operations in the contract.
Sometimes identical operations are found in several unrelated contracts, and these
operations are logically related to their respective contracts. Not including them
would make the contract less cohesive. For example, suppose that for safety rea-
sons, the system must immediately abort all devices. In addition, all devices must
support some kind of diagnostics that assures they operate within safe limits.
Logically, aborting is just as much a scanner operation as reading, and just as
much a belt operation as starting or stopping.
In such cases, you can factor the service contracts up, into a hierarchy of contracts
instead of separate contracts:
interface IDeviceControl
{
void Abort();
long RunDiagnostics();
}
interface IReaderAccess : IDeviceControl
{...}
interface IBeltAccess : IDeviceControl
{...}
MEASURING CONTRACTS
It is possible to measure contracts and rank them from worst to best. For exam-
ple, you could measure the cyclomatic complexly of the code. You are unlikely to
have a simple implementation of a large complex contract, and the complexity of
overly granular contracts would be horrendous. You could measure the defects
associated with the underlying services: Low-quality services are likely the result
of the complexity of poor contracts. You could measure how many times each
contract is reused in the system, and how many times a contract was checked out
and changed: Clearly a contract that is reused everywhere and has never changed
is a good contract. You could assign weights to these measurements and rank the
results. I have conducted such measurements for years across different technology
stacks, systems, industries, and teams. Regardless of this diversity, some uniform
metrics have emerged that are valuable in gauging the quality of contracts.
SIZE METRICS
Service contracts with just one operation are possible, but you should avoid them.
A service contract is a facet of an entity, and that facet must be pretty dull if
you can express it with just one operation. Examine that single operation and
ask yourself some questions about it. Does it use too many parameters? Is it too
coarse, so that you should factor the single operation into several operations?
Should you factor this operation into an existing service contract? Is it something
that should best reside in the next subsystem to be built? I cannot tell you which
corrective action to take, but I can tell you that a contract with just one operation
is a red flag, and you must investigate it further.
Interestingly, in the human world you always use contract size metrics to assess the
quality of a contract. For example, would you sign an employment contract that
has just one sentence? You would decline this contract because there is no way
that a single sentence (or even a single paragraph) could capture all the aspects of
you as an employee. Such a contract is certain to leaves out crucial details such
as liability or termination and may incorporate other contracts with which you
are unfamiliar. On the other extreme, would you sign an employment contract
containing 2000 pages? You would not even bother to read it, regardless of what
it promises. Even a 20-page contract is cause for concern: If the nature of the
employment requires so many pages, the contract is likely taxing and complex. But
if the contract has 3–5 pages, you may not sign it, but you will read it carefully.
From a reuse perspective, note that the employer will likely furnish you with the
same contract as all other employees have. Anything other than total reuse would
be alarming.
AVOID PROPERTIES
Many service development stacks deliberately do not have property semantics in
contract defi nitions, but you can easily circumvent those by creating property-like
operations, such as the following:
string GetName();
string SetName();
Interestingly, you can derive the number of contracts per service using the estima-
tion techniques of Chapter 7. Using only orders of magnitude, should the number
of contracts per service be 1, 10, 100, or 1000? Clearly, 100 or 1000 contracts is a
poor design, and even 10 contracts seems very large. So, in order of magnitude,
the number of contracts per service is 1. Using the “factor of 2” technique, you
can narrow the range further: Is the number of contracts more like 1, 2, or 4? It
cannot be 8 because that is almost 10, which is already ruled out. So the number
of contracts per service is between 1 and 4. This is still a wide range. To reduce
the uncertainty, you can use the PERT technique, with 1 as the lowest estimation,
4 as the highest, and 2 as the likely number. The PERT calculation yields 2.2 as
the number of contracts per service:
1+ 4 * 2 + 4
2.2 =
6
In practice, in well-designed systems, the majority of services I have examined had
only one or two contracts, with a single contract as the more common case. Of the
services with two or more contracts, the additional contracts were almost always
non-business-related contracts that captured aspects such as security, safety, per-
sistence, or instrumentation, and those contracts were reused across other services.
USING METRICS
The service contract design metrics are evaluation tools, not validation tools.
Complying with the metrics does not mean you have a good design—but violat-
ing the metric implies you have a bad design. As an example, consider the fi rst
version of IScannerAccess. That service contract has 4 operations, right in the
middle of the range of the 3 to 5 operations metric, yet the contract was logically
inconsistent.
THE C ONTR ACT D ESIGN C HALLENGE 423
Avoid trying to design to meet the metrics. Like any design task, service contract
design is iterative in nature. Spend the time necessary to identify the reusable
contract your service should expose, and do not worry about the metrics. If you
violate the metrics, keep working until you have decent contracts. Keep examining
the evolving contracts to see if they are reusable across systems and projects. Ask
yourself if the contracts are logically consistent, cohesive, and independent facets.
Once you have devised such contracts, you will see that they match the metrics.
Even senior developers may require mentorship to be able to design contracts cor-
rectly, and you, as the architect, can guide and train them. This will enable you to
make the contract design part of each service life cycle. With a junior team, you
cannot trust the developers to come up with correct reusable contracts; most likely,
they will come up with service contracts resembling Figure B-1 or Figure B-2. You
must use the approach of Chapter 14 to either carve up the time to design the
contracts before work begins or, preferably, use a few senior skilled developers to
design the contracts of the next set of services in parallel to the construction activ-
ities for the current set of services (see Figure 14-6). You should use the concepts
of this appendix and Figure B-4 to educate your manager on what it really takes
to ship well-designed services.
This page intentionally left blank
C
DESIGN STANDARD
The ideas in this book are simple and consistent both internally and with every
other engineering discipline. However, it can be overwhelming at first to come
to terms with this new way of thinking about system and project design. Over
time and with practice, applying these ideas becomes second nature. To facilitate
absorbing them all, this appendix offers a concise design standard. The design
standard captures all the design rules from this book in one place as a checklist.
The list on its own will not mean much, because you still have to know the context
for each item. Nevertheless, referring to the standard can ensure that you do not
omit an important attribute or consideration. This makes the standard essential
for successful system and project design by helping you enforce the best practices
and avoid the pitfalls.
The standard contains two types of items: directives and guidelines. A directive is
a rule that you should never violate, since doing so is certain to cause the project
to fail. A guideline is a piece of advice that you should follow unless you have a
strong and unusual justification for going against it. Violating a guideline alone is
not certain to cause the project to fail, but too many violations will tip the project
into failure. It is also unlikely that if you abide by the directives that you will have
any reason to go against the guidelines.
425
426 A PPENDIX C D ESIGN S TANDARD
DIRECTIVES
1. Avoid functional decomposition.
2. Decompose based on volatility.
3. Provide a composable design.
4. Offer features as aspects of integration, not implementation.
5. Design iteratively, build incrementally.
6. Design the project to build the system.
7. Drive educated decisions with viable options that differ by schedule, cost, and risk.
8. Build the project along its critical path.
9. Be on time throughout the project.
431
432 I NDEX
TradeMe project design case study time- Optimal project design point
cost models, 358–-359 with risk, 252
Modules. See Services in project design in action, 301
in TradeMe project design case study,
359–360
N Optionality, communication with
Naming conventions, services, 65–66 management, 366–367
Network compression. See Compression, of Outliers floats, adjusting
schedule with geometric activity risk, 318
Network diagrams. See also Critical path project design in action, 295
analysis in TradeMe case study, 356
arrow diagram, 196–197 Overestimating, project, 158, 402–403
arrow versus node diagrams, 197–198
introduction, 167
node diagram, 196 P
overview of, 195–196 Parallel life cycles, 374
project design in action, 261–262, 271, 273, Parallel, working in
278, 283, 287 compression with simulators, 284–285
TradeMe project case study, 342, 347, 351 high compression and, 248–249
Network of networks infrastructure and client designs fi rst,
benefits of, 328 280–283
countering Conway’s law, 331 multiple developers per service, 154
creative solutions, 330–331 parallel work candidates, 212–213
designing, 328–331 project complexity and, 322
Network, project. See Project network project design in action, 278–280
Node diagram senior developers as junior architects,
versus the arrow diagram, 197–198 376
introduction to, 196 splitting activities, 211, 280
Nonbehavioral dependencies, TradeMe TradeMe project design case study,
project, 340 346–347
Noncoding activities, TradeMe project, 338 work and cost, 213
Nonstructural coding activities, TradeMe Parkinson’s law
project, 336–337 increased risk, and, 239
Normal solution overestimation and, 158
decompression, 250 probability of success, and, 159
fi nding, 220–221 time crunch, 6
introduction to, 215 Parnas, David. See On the Criteria to Be Used
and minimum direct cost, 251–252 in Decomposing Systems into Modules
and minimum total cost, 225–228 (Parnas)
project design in action, 265–276 Peer reviews, importance of, 148, 210, 381
and the risk curve, 238 PERT (Program Evaluation and Review
risk metric guideline, 253 Technique), 162, 422
time-cost curve, 220–221 Phases, project
TradeMe project design case study, project design in action, 264
341–346 TradeMe project design case study, 342
Phi. See also Golden ratio, 244–245, 317
Planning assumptions
O calendar dates and, 176
On the Criteria to Be Used in Decomposing in debriefi ng, 378
Systems into Modules (Parnas), 34 in design of project design, 271–272
Open architecture. See also Closed in design standard, 428
architecture in project design in action, 263–265
introduction to, 75–76 in TradeMe project design case study, 341
issues of calling up and sideways, 107 Plans/planning. See also Earned value
semi-open architecture, 76–77 planning; SDP (Software Development
Operational concepts, TradeMe project, Plan) review
119–122, 340 benefits of having multiple, 152
Operational dependencies. See also Activity importance of staying on, 399
dependencies, 339–340 STP (service test plan), 389
438 I NDEX
Service protocols, internal and external, 74–75 Standard operating procedures (SOPs), quality
Service requirement specification (SRS), 389 assurance, 382
Service test plan (STP), 389 Standards
Services for design. See Design standard
assigning developers to, 153–157 importance of, 209, 382
bloating and coupling, functional services, Static architecture
17–19 The Method structure, 60–61
business logic service, 117–119 project design in action, 256–257
clients: bloating and coupling, 15–17 TradeMe system design case study, 116
and contracts, 411–415 Status, of projects. See Project progress
development life cycle, 388–389 Status reports. See Reporting progress
granular in actor model, 121 Storage volatility, trading system example,
service-level testing, 380 43, 46
service names, 65–66 STP (service test plan), 389
size and quantity, 407–411 Structural activities, TradeMe project,
smallest set of components, 89–90 336–337
and subsystems, 70–75 Structural dependencies, abstract. See also
too many or too big, 14–15 Activity dependencies, 339
using services to cross layers, 59–60 Structure
Shallow S curve. See also Earned value classification guidelines, 65–70
planning client and business logic layers, 60–63
earned value, 190–194 design don’ts, 80–81
in project design in action, 266, 271, 274 introduction to, 55
throughput analysis, 287 layered approach, 58–60
tracking project progress, 395 open and closed architectures, 75–79
in TradeMe project design case study, 344, ResourceAccess layer, 63–64
349, 353 subsystems and services, 70–75
Simulators typical layers in The Method, 60–61
compared with the normal solution, 286 use cases and requirements, 56–58
in compression, 212, 280 Subcritical staffi ng
compression using the, 284–286 compared with load leveling, 183
for god activities, 308 introduction, 172
removing dependencies, 212 project complexity and, 322
simulators solution in project design in project design in action, 272–274
action, 284–286 in staffi ng distribution chart, 180
Siren song of bad habits, 47–48 TradeMe project design case study,
Skills 353–354
identifying areas of volatility, 34–36, 52–53 Subsystems
improving development skills, 208–209 and services, 70–75
practicing project design, 377–378 and the timeline, 373–374
Small projects, 331–332 Success
Smoke tests, daily, 380 defi ning, 145–147
Software architect. See Architects probability as a function of estimation,
Software Development Plan review. See SDP 158–159
(Software Development Plan) review Swim lanes, activity diagrams, 105, 112,
Splitting activities, 211, 280 124–130
SRS (service requirement specification), 389 System design. See also Layered design;
Staffi ng. See also Resources Requirements analysis
calculating project cost, 184–185 anti-design effort, 20–21
as cost element, 228–230 architecture versus estimations, 365–366
distribution charts, 177–183 benefits of extensibility, 72
distribution mistakes, 179–183 composition. See Composition
elasticity of, 186 decomposition. See Decomposition
initial staffi ng in project design, 147–151 design decision tree, 7–8
planning requirements, 263–265 design standard guidelines, 426–427
project design in action, 266, 285–286 introduction to The Method, 4–5
smooth staffi ng distribution, 183 modular system design, 409–411
TradeMe project design case study, 341–342, structure. See Structure
345–346, 348–350, 351–352 validation, 5–6
442 I NDEX