24 Patterns For Clean Code - Techniques For Faster, Safer Code With Minimal Debugging (2016 Robert Beisert) (C Programming)
24 Patterns For Clean Code - Techniques For Faster, Safer Code With Minimal Debugging (2016 Robert Beisert) (C Programming)
Documenting as you go
First, we have to remember a vital documentation rule: Everything which is
related to the code is documentation. All your comments, graphs, doodles,
and idle musings are valuable insights into the mind of the programmer
which constructed this code (as opposed to the programmer who is currently
reading the code).
I highly recommend that, as you write your code, you place useful comments
and tags in there. Something as simple as “THIS BLOCK IS FULL OF VARIABLES
WE NEED FOR THIS OPERATION” can save you minutes of inquiry, while longer
comments detailing the flowchart for a function (or subfunction) can save
you hours. Also, if you use meaningful naming conventions and meaningful
cross-referencing, you can skip around your code in seconds, connecting the
spaghetti into a single whole.
Of greatest importance to the end-user, always document your function’s
purpose, return types, and arguments before you even start fleshing out the
details. This will spare you confusion (as you never let the complexity of the
function spiral beyond the defined purpose), and it will allow you to create
your final documentation almost effortlessly.
NATURALLY, IF YOU MAKE FUNDAMENTAL CHANGES TO A FUNCTION, YOU HAVE TO
UPDATE YOUR DOCUMENTATION. IT MAY STILL BE WORTH LEAVING SOME OF THE
OLD CODE COMMENTED OUT, JUST SO YOU HAVE EVIDENCE OF A USEFUL PATTERN
OR IDEA.
Doxygen
I am a tremendous fan of the doxygen program. This program interprets
declarations and specially-formatted comments to generate HTML
documentation, complete with hyperlinks, graphs, and flow-charts. With this
tool, you can rapidly convert useful source-code commentary into powerful
documentation.
The format is extremely minimal. All comments surrounded by /** … **/
blocks are interpreted by Doxygen, along with all in-line comments starting
with ///. This means that a single-character addition in a regular comment
results in a doxygen comment.
The tag system in doxygen is also simple. Every line beginning with a
particular tag will produce a section of meaningful output in the final
documentation. These tags include (but are not limited to):
@param – Describes a function input parameter
@param[in] – Declares the parameter to be
input only
@param[out] – Declares the parameter to be
output only
@param[inout] – Declares that the parameter
is read in, transformed, and passed back out
@return – Describes the expected return value(s)
@author – Lists a name as the author of a file, function,
etc.
@brief – Declares that the line which follows is a brief
description of the function
@detail – Declares that everything up to the next tag is
part of a detailed function description
We also can use ///< to describe enumerations, objects within a structure, etc.
It is a tremendously powerful tool which encourages ample in-source
commentary and provides rapid documentation on demand. For more
information, check out the Doxygen user’s manual.
Yank to Mark
We know that the vi equivalent to "copy-and-paste" is "yank-and-put". What
many people do not realize is that you can open any number of files in a
single vi instance, allowing you to yank from one file and put into another.
The final piece of this puzzle is the "mark" feature. It is possible for us to
create a mark on a given line with the following command:
m<letter>
EXAMPLE: ma
We can then return to that line with the command:
'<letter>
EXAMPLE: 'a
This allows us to yank large sections of code from one place to another using
the command:
y'<letter>
EXAMPLE: y'a
This is the "yank to mark" feature.
Application 1: Code Completion
We can use the yank-to-mark feature to copy entire function prototypes back
into our new code. To do this, we first navigate to the spot where our function
call will go. We can then go to the header file containing the required
function prototype and yank it out. Finally, we "put" the prototype into our
new code, then swap the prototype values out for our new values.
For example, suppose we have a prototype that looks something like this in a
file FILE.H:
int function(
int Value_one,
char *Value_two,
struct astructure Value_three
)
If we want to call this function in main.c, we can open vi like so:
vi main.c file.h
We then yank everything from "int function(" to ")" using our "yank-to-
mark" feature and put it into main.c.
At that point, it is a simple matter to replace the return value and all
parameters with the values that we should pass into the function, all while
ensuring that the types match.
Problem 2: Aliasing
Most beginning programmers are unaware that the UNIX/Linux systems
allow you to assign an alias (new name) to a function or set of commands.
These are stored in the system instead of in a code file.
The problem here should be obvious: the other guy's computer lacks your
alias.
This usually trips up programmers when they are helping out a friend, only to
find that their aliases don't work.
Generally speaking, don't use aliases unless you can comfortably operate
without them.
Problem 3: Other
There are any number of customization options on machines these days, and
most of them are harmless. At the end of the day, it usually doesn't matter to
me whether the machine has a GNOME, X, or KDE environment. I can even
work with VI when someone's fiddled with all the highlighting options.
However, when you start fussing with anything non-standard (libraries,
programs, etc.) , you make it harder for everyone else to replicate what you're
doing. In a corporate or open-source sense, that's disastrous.
Principles of Classes
Before we consider a structure for naming classes, we have to consider the
meaning of "class".
The highest-level observation about classes is that they act as a unit.
Specifically, a class consists of a structure and a set of functions that operate
on that structure. Of course, the structure is the key root of the class, because
all the functions address that structure in one way or another.
Second, we can observe that classes in Java are relegated to their OWN
separate code files. This is due to a shortcut by the Java developers (who
decided it was easier to prevent collisions by relegating classes to specifically
named files), but it is applicable to all coding conventions. In compliance
with the unitary nature of classes, it makes sense to keep all the functionality
required for a class in a single file.
Finally, we observe that each class has its own functions (or "methods") that
work explicitly on that class object. These methods may not have globally
unique names in the source code (which is a cause for complaint by those of
us who employ Linux utilities to accelerate our programming), but the
symbol table produced always contains globally unique names by prepending
a host of unique identifying values, which may include class name,
namespace, and (potentially) file name from which the function is derived.
These observations allow us to construct classes in C, only with much
simpler dynamics.
Naming Convention
Let's approach the naming convention using the pattern of our observations.
First, we note that classes are built around a structure. This structure should
have a unique name (because all names in C must be unique). Furthermore,
all functions which we would call "class methods" in an OOP language
should be visually tied to that structure.
First: All functions related to a structure ("class") must contain the
name of that structure.
Second, we note that all classes in Java are relegated to a single file. This is a
reasonable practice.
Second: All class structures and functions must be contained in the same
file.
Third, we observe that each symbol produced by the OOP compilers and
interpreters is globally unique, regardless of the original name's uniqueness.
We can apply that to our code by prepending an aspect of the file name
directly to everything related to the class.
Third: All class structures and functions must begin with the name of the
file in which they are contained (or a logical subset thereof, which is
known to all users).
Constructors
Constructor functions take a number of arguments and return a pointer to an
allocated space. Even so, there is a simple rule for meaningful returns:
Constructors return either a newly allocated pointer or NULL
Of course, if you work with a slightly more complex constructor, you can
return the pointer in a parameter. In these cases, you should still make the
constructor return something meaningful.
Personally, in these cases, I use a tri-state return type. If the function is
successful, I’ll return the size of the allocated space (in bytes). If it has an
error, I’ll return a negative value correlating to the type or location of the
failure. However, if the function works EXCEPT that the malloc does not
successfully allocate any space, I’ll return 0.
Meaningful names
We’ve discussed defining function and object names which employ a
CLASS_OBJECT_function structure. The problem is, there’s little difference
between the following Java and C:
Class.function
Class_object_function
In both of these cases, if the “function” name is just “solve”, there’s not a lot
for us to go on.
Now, if we replaced the function name with something like “RSA_encrypt”
or “DES_rotate”, we have something meaningful. We now know that the
function is supposed to (in the examples) encrypt some input using RSA or
perform the rotate operation defined in the DES standard.
Now we know exactly what the function is supposed to do, so we don’t have
to go very far to determine whether it’s working properly or not.
This goes further than simple object and function naming conventions. When
you instantiate a variable, you run into the same problems.
Suppose you have a set of variables with names like these:
int thing;
char *stuff;
uint8_t num;
While these may be fine in small functions (because there’s no room for
confusion – you see it instantiated and employed in the same screen), they’re
problematic in larger and higher-level functions.
If your main function has variables like that, you don’t know what they’re
supposed to represent. However, if they have names like these:
uint64_t RSA_256_key_pub[4];
int object_expected_size;
char *error_message;
Suddenly, we have an idea of what these variables do.
Admittedly, this increases the amount of time you spend typing (and my heart
bleeds for you), but it reduces confusion while you’re writing the code and
makes maintenance hundreds of times easier.
Shipped code
So, you shipped your code. If you're a giant AAA game company like EA,
you might be able to get away with ignoring the vast majority of error reports
(because you're busy raking in money from your annual franchises), but the
rest of us take some pride in our release code. More importantly, corporate
programs need to work or the company takes on work, risk, and financial
liability.
Have you ever tried to run a debugger over code without debug tags? There
are no variable names, the registers hold meaningless values, and you will
spend more time trying to read the log outputs than you do fixing the
problem.
Worse still, if you try to recompile with debug tags, you might eliminate or
shift the problem. This means that you will not successfully debug the
problem, and you may well introduce new bugs into the code.
This is why it's valuable to release code with the debug tags intact.
When an error report comes back with all of your (highly detailed) logs, you
can quickly recreate the issue in your debugger. Even better, you can set
effective breakpoints and manipulate the variables such that you are now able
to determine the source of the error and correct it, distributing a quick and
(relatively) painless patch to eliminate the issue.
THERE ARE EXCEPTIONS. IF WE'RE DEALING WITH SECURE CODE, WE DO NOT WANT IT
TO BE EASY FOR JUST ANYONE TO ACCESS THE INTERNALS AND MODIFY THEM. IN
THESE CASES, WE CAN'T EMPLOY THIS PATTERN (WHICH MEANS WE HAVE TO
DOUBLE-DOWN ON OUR LOG FILES TO MAKE UP FOR IT).
Error orientation
We notice that, in happy path programming, there is exactly one path from
start to finish. Because there are so few branches by default, most
programmers write code in a way that makes error handling more
complicated. The patterns they tend to employ include:
Wedging all the functionality into one (main) function
Assuming all functions terminate with perfect results
Failing to create log outputs to track flow through the
program
The problem with these patterns is that they’re unrealistic. Wedging all the
functionality into the main function eliminates modularity, which reduces the
efficacy of error checking. Assuming all functions are successful is a fool’s
dream. And if you don’t create logs when things go right, you’ll never figure
out how they went wrong.
The solution to all of this is to design with failures in mind.
Enforce modularity, because it limits the range of things that could go wrong
at one time.
Check every input to every function, to make sure they aren’t nonsense (like
writing 500 bytes to a single character pointer).
Use the if-else programming pattern to make functions internally modular as
well as externally modular.
Create lots of logs. You want to keep records of everything that’s supposed to
happen and everything that’s not supposed to happen. For every possible
failure, attach a specific failure message that references the location of the
failure in your code. Between these logs, you’ll know exactly what’s
happening inside your program at all times.
If you create habits that revolve around all the things that can go wrong
instead of what you expect to go right, you’ll vastly reduce your maintenance
and debugging workload.
Common Practices
There are any number of ways to deal with errors outside of a rigidly defined
paradigm.
Some (like those who operate predominantly in C++) may define supervisor
functions that simulate the try-catch expression. This is less than common,
because in defining the supervisor function you usually begin to appreciate
the range of all possibilities. When you start trying to handle every possible
error with one megafunction, you start to appreciate the simplicity of catching
errors manually.
The most common practice is to test the output of functions and related
operations. Whenever you call an allocation function like malloc() or calloc(),
you test the output to ensure that space was properly allocated. When you
pass-by-reference into a function, you test the inputs to ensure that they make
sense in the context of your program. Methods like these allow us to
manually handle both the flow and the error-handling of our code.
However, in most cases we have a “multiple-breakout” pattern of tests. These
patterns look something like this:
char * blueberry = (char * ) malloc(50*sizeof(char))
if(blueberry == NULL)
return -1;
int pancake;
do_thing(blueberry, "there is stuff", pancake);
if(pancake < 0 || pancake > 534)
return -2;
do_other_thing(pancake, time() );
if(pancake < 65536)
return -3;
...
return 0;
This pattern runs the risk of terminating before memory is properly freed and
parameters are properly reset. The only ways to avoid this terrible condition
are to manually plug the cleanup into every error response (terrible) or to use
goto (not terrible, but not strictly kosher).
If-Else Chain
There is one method for handling errors that is consistent with another pattern
we cover (error-orientation): we build a great if-else chain of tests.
This pattern is confusing to many for two reasons:
If fundamentally reorients the code away from the
“happy-case” paradigm (in which all error-handling is a
branch off the main path) to a “failure case” paradigm
(in which the happy-case is the result of every test in the
chain failing)
All our happy-path code finds itself inside of an if()
statement – nothing can be permitted to break the chain
It’s a bit hard to describe this pattern without an example, so bear with me:
int copy(const char * const input, int size, char * output)
{
int code = 0;
if( input == NULL )
{
code = -1;
}
else if ( output = (char *) malloc (size * sizeof(char) ) , output == NULL
)
{
code = -2;
}
else if ( strncpy(output, input, size), 0 )
{
//impossible due to comma-spliced 0
}
else if (strncmp(output, input))
{
code = -3;
}
else if ( printf("%s\n", output) < 0 )
{
//printf returns the number of printed characters
//Will only be less than 0 if write error occurs
code = -4;
}
else
{
//could do something on successful case, but can't think of what that
would be
}
//Normally we would pass output back, but let's just free him here for fun
//This is where we do all our cleanup
if(output != NULL)
free(output);
return code;
}
As we can see, each step in the error-handling function is treated as a possible
error case, each with its own possible outcome. The only way to complete the
function successfully is to have every error test fail.
Oh, and because this code is fundamentally modular, it is very easy to add
and remove code by adding another else-if statement or removing one.
Lesson: Switching to an if-else chain can
improve error awareness and accelerate your
programs in operation, without requiring much
additional time to design and code.
Check the boundaries (including nonsense)
A wise quote: “When it goes without saying, someone should probably say
it.”
This is one of the better known patterns out there, but it still bears repeating.
Extrapolating to Strings
The boundary rule says that we should check just inside and outside of the
boundaries that show themselves, but how does that translate to complex
values like strings?
With strings, we are usually either handling them character-by-character or as
a unit. In the first case, you can use the character boundary rules, but in the
second case the biggest thing we worry about is size. Here, we rely on the
integer rule.
For example, if you have allocated a buffer of 25 characters, you should test
that buffer with the following values:
24
25
26
Ideally, you have no problems with 24 or 25, but there is a chance that you
forgot to clear the buffer, leaving an unexpected value inside your buffer on
the next use. In this case, testing for 24 will ensure that you see and eliminate
the issue.
For 26, we know that we’re pushing beyond the boundary of the string. In
this case, we want to ensure that we get some form of error handling
response, assuring us that we’ve eliminated buffer overflows in this section
of the code.
And it goes on
We can use the same rule for address ranges (which are dependent on your
platform, but follow similar rules to string buffers), or for structure values, or
for just about anything we ever do on a machine. It’s a basic principle.
We should also note that, when we’re dealing with robust tests, we want to
test every possible boundary state. For example, for a simple line like this…
1 if ( i < 0 && j > 36)
…we should test every combination of the boundaries around i=0 and j=36.
This ensures that we’ve covered all our bases, and nothing strange goes
wrong.
Yin: Beginnings
In the binary unifying force of the cosmos, there are the forces of yin and
yang. These elements are equal complements to one another – where one
exists, the other necessarily exists.
The element of Yin is the feminine element, and it is associated with
beginnings and creation. For our purposes, the Yin is all things which begin a
course of action.
These are the opening brackets and braces.
These are the allocation functions.
These are the function prototypes and “happy path” logic.
All of these things, which define and begin courses of action, are Yin.
Yang: Endings
Yang, on the other hand, is a masculine element associated with action,
completion, and death. For our purposes, the Yang is everything that closes
off a course or path of action.
These are the closing brackets and braces.
These are the free or destruction functions.
These are function definitions and error-handling paths.
All of these things are Yang.
Yin and Yang are One
In the Eastern philosophy of the Tao (literally, the “way”), yin and yang are
balanced forces that continue in a circle for eternity. The elements pursue one
another – destruction follows creation, and a new creation is born out of the
old destruction. The masculine and feminine forces of the cosmos require one
another for total balance and perfect harmony in the universe.
So, too, it is in our code.
When we create an opening brace, we must immediately define the closing
brace, so that the balance is preserved.
When we allocate a resource, we must immediately define its time and place
of destruction, so that balance is preserved.
When we prototype a function, we must immediately define its features, so
that balance is preserved.
By keeping the two halves of our operations united, we ensure that we never
have to chase down the imbalances between them.
Iterator Abuse
Just about every program out there uses an iterator for one reason or another.
Without them, we can't build for loops, and more than a few while loops
require iterators as well.
They let us perform simple operations on buffers and arrays.
They let us control flow of information.
Fundamentally, an iterator is a control line. It is essential that we maintain
the integrity of our controls, because without that control we exponentially
increase the complexity of the problem.
Iterator Abuse is the act of violating the integrity of an iterator, which
destroys the line of control and makes the program act in complicated or
unexpected ways. This abuse is performed in a number of ways:
Reusing the iterator before its control function has been
completed
Modifying the iterator inside of the loop (usually by a
non-standardized unit)
Passing the iterator into a function without protection
Using your only copy of a pointer as an iterator
etc.
Header Files
A header file is essentially an interface between a piece of source code and
another piece of source code. We can think of it as a library index; it tells us
where to find functions and what we need to do to reference them.
However, there is a big problem in the header world (especially for those
who come over from C++): some programmers put code in the header file.
This is a problem for a couple of key reasons:
It's logically inconsistent - headers and code are separate
for a reason
It eliminates library safety - we don't compile header
files to object code, so we can't include that code in a
library
It extends the number of files we have to check - a
problem can now occur in all headers (in addition to all
code)
It's just plain ugly
Generally speaking, we only put structure definitions, function prototypes,
and #define macros in a header file. These things are 100% reference - they
can't actually produce any code by themselves.
When we do this, we enforce an essentially modular design. The compiler
uses headers to create "plug-ins" where library functions should go, and the
linker plugs those functions in when the time is right. Everything does its job,
and all is right with the universe.
Analogue: #include "*.c"
Very few programmers have ever compiled a source with a #include'd code
file.
I have.
It's a dumb thing to do, because it complicates compilation and doesn't make
linking any easier.
It's also just bad juju.
Rough Outline
We usually start with a piece of pseudocode that generally describes what we
want a function to do. This can be written in basically any format (text,
outline, pseudocode, diagram, etc.), but it has to detail the basic elements we
expect to find in our function.
Specifically, our outline serves these two purposes:
Expose functionality for better breakdown
Reveal the interface elements that we might want or
need
If we skip this step entirely, we tend to spend a lot of time reworking our
basic code as we discover features we wanted to include.
7 +/- 3
Those who study psychology (whether professionally or as a hobby) notice
that people cannot hold too many thoughts in active memory at once. If
you've ever crammed for a test, you know that by the time you get halfway
through the terms, you've "put away" the meanings of the first few terms.
Meanwhile, you can instantly recall the definition for the last term you
studied.
Hypnotists (who actively manipulate the conscious and unconscious mind)
have a rule: people can only hold between 4 and 10 thoughts in their minds at
once. There are several powerful hypnotic techniques that rely on this truth
(specifically the "hypnotic blitz", which barrages the recipient with an idea
until it sticks).
It's important for every engineer, designer, and programmer to remember this
truth: You can only generally hold 7 +/- 3 thoughts in your mind at once.
Unity of Function
If we adhere to the principle of top-down modular design, all our functions
should be relatively small. At the bottom of the tree, every function serves
one purpose and accomplishes one basic thing. As we climb up, each layer of
functions simply combines the basic unit functions to perform a less-basic
unit function.
As an example, consider the design of a simple server. Some programmers
would design the server as a single block, where every operation is written
directly into the main() loop. However, under the unity principle, the code
should look more like this:
main
Lookup
Initial connection
Open socket
handshake
exchange certificates
send certificate
receive certificate
check remote certificate
receive request
check buffer
deserialize
test
reply on failure
Serialize message
Attach header
Attach trailer
transmit
Retrieve data
Serialize message
Attach header
Attach trailer
disconnect
protocol based – close socket
Unity of Form
Unity of form describes the modularity of objects. Each object should be
designed to serve one logical purpose.
For example, the OpenSSL protocol defines an SSL socket distinct from a
basic socket. The SSL socket is composed of two units:
A socket file descriptor
A context structure containing information relevant to
the SSL protocol
Even though the context structure can contain a substantial amount of varied
data, at the top level it looks like a distinct unit. It has one form – a container
for SSL data.
It’s a bit more difficult to describe the unity of form, but the basic rule is that
you should break out what logically makes sense into its own structure.
Lesson: Each structure serves one logical
purpose, and each function does one logical
task. In this way, we keep things simple for
programmers to understand and manipulate.
Design: Everything You Need, and Only
What You Need
How many times have I seen this pattern in object oriented code?
Public Class
All members of the class are private
All members can be accessed with accessor functions
All members can be changed with modifier functions
I hate to break it to y'all, but that's just a struct. The only thing you've done is
add a hundred lines to a simple process.
Encapsulation
We've already discussed one aspect of encapsulation (that is, the unitary
nature of objects), but the principle also includes "data hiding". In "proper"
encapsulation, the structures are perfectly opaque to the end user, so that the
only way to access the data is through carefully-constructed methods.
The techniques for constructing methods are already well understood, but
how does one create hidden data in C?
Fun fact: If it's not in the header file, the end-user can't touch it.
Fun fact #2: We can use typedef to create pointers to objects in code files
which do not themselves appear in the header.
Fun fact #3: While the computer can translate between the typedef pointer
and a pointer to the original object at link time, the user's code cannot.
Thus, if we define struct panda in a code file, but the header only contains
methods and the following:
typedef struct panda *PANDA
the end user can only access the structure by passing PANDA to the access
methods.
Abstraction and Polymorphism
We perform abstraction in only two ways:
Function pointers - used to look up objects
Hiding data with "typedef"
Function pointers are too complicated to go into now, but basically we can
write functions that accept pointers to functions, the results of which produce
data that the top-level function can work with. This allows us to create
essentially polymorphic functions.
Because abstraction is relatively complicated in C, it encourages
programmers to rely on patterns and techniques instead of function-
overloading and "let the computer figure it out" mental patterns. That means
we don't employ template classes, interface classes, or abstract classes
(which I would argue ultimately make programming MUCH harder), but we
can still create functional polymorphism and abstraction if we so choose.
NOTE: WE DO CREATE "TEMPLATES" IN C. THESE ARE NOT OOP TEMPLATES, BUT
RATHER CODE WHICH IS GENERALLY APPLICABLE TO A NUMBER OF TASKS. WE CAN
SIMPLY USE THESE FRAMEWORK TEMPLATES TO PRODUCE NEW CLASSES QUICKLY.
Inheritance
What exactly does inheritance mean?
Functionally, inheritance ensures that one class contains as its first (and thus,
most easily addressed) member another class. All the aspects of the parent
class are carried into the child class, so that the child is merely an extended
version of the parent. As such, all the parent functions should be able to work
with the child class.
We can do the exact same thing in C using type casting.
We know that we can tell a C program to treat data elements as some other
kind of data element. When we do this, C "maps out" the space that the
casted element should occupy in the original element and treats that space as
the cast element.
It looks something like this:
Sure enough, so long as we ensure that the "parent" structure is the first
element in the "child" structure, we can use type casting to perform
inheritance. It's really that simple.