No description
Find a file
2025-08-20 09:01:47 +02:00
README.md new entry: structured logging 2025-08-20 09:01:47 +02:00

Dev Notes

This is a random and unsorted collection of notes about several topics like CS, programming, people management, company dynamics and software planning.

Important: those notes could have information that is not accurate or true.

Note to me

When linking a video: always include the title of the video and the author (if not already mentioned).

1 Software Planning

.001 Casey Muratori on planning software

How most of us view software architecture:

o---------------------o     o-------------------o
| software architect  | --> | programmer builds |
| creates a blueprint | --> | the program       |
o---------------------o     o-------------------o

But software does not work that way. If you deeply architect the software, you basically already build it, but in a different language. Mostly: UML Graphs.

A better way to view it: Urban planning. You create a basic layer where things should be, but giving the freedom to the programmer how to exactly they should build it.

o-------------------------o     o------------------o     o----------------------o
| urban planner           | --> | architect        | --> | builder / contractor |
| (aka software designer) | --> | (aka programmer) | --> | (aka compiler)       |
o-------------------------o     o------------------o     o----------------------o

I'm now replacing the term "Software Architect" with "Software Planner" or "Software Designer".

.002 Eskil Steenberg on designing for large projects

Note: E. Steenberg talks about very big projects, so not everything could apply to every code base.

Key points that are important (to me)

  • every module you don't own/control: wrap it inside a custom platform layer
  • split the code in modules and only one developer is responsible for it
  • a developer can have multiple modules, but a module cannot have multiple developers
  • keep your APIs consistent
  • if you require API modifications, consider extending it by a new module

E. Steenberg talks about "plugins", where I refered to "modules". In his case, he meant actual *.dll's.
In my case, I would keep the modularity inside the code ... but I have to think more about it.

2 Security in Software

.001 Felix "Fefe" von Leitner's Talks

.01 Trusted Computing Base (TCB)

tbd

.02 OS Privileges

tbd

.03 Security in general

tbd

.04 Writing Secure Software

tbd

.002 Security Engineering

tbd

.003 Structured and immutable Logging

In some systems, it is important that the logs are structured so we're able to correlate events.
Also some logs, like audit logs, must be protected against tampering. Like using hash chains (Merkle Trees), append only permissions, and timestamps from a NTP server.

The following Mastodon thread is from Kris.

Post 1 Post 2 Post 3 Post 4

In case those posts won't exist in some time, I'll summarize it here:

  • structured logging is important - i.e. date, host, pid, program, error, ...
  • NTP synced timestamps
  • some logs are more important, like accouting or audit logs
  • important logs need objects like: subject="Bob", role="admin", method="sudo"
  • if violations are detected inside those logs, (disciplinary) action must be taken
  • important logs must be append only and protected against tampering
  • common for important logs is a Merkle-Tree/hash chain
  • journald provides such utilities
  • the program shouldn't have direct control over the logging. They should use an API for that

3 Network

.001 Measuring Network Latency

Key points:

  • most users will experience the 99th percentile
  • we want 99.9... percentiles
  • in most measuring tools, they hide the important data
  • when load testing, test latency at every load step not only at max load
  • find out, when the tipping point occurs
  • define your goals under specific loads
  • widespread problem: coordinated omission

Coordinated Omission
describes scenarios, where the important data is masked.
For example, you're measuring latency within your code, but some runtime induced lag hits (GC, buffer flush, thread block, context switch). Then your measurment is incorrect, since it might miss or skew your data.

Or if you send some request per $time_intervall, but the request hangs for a bit, so your intervall is skewed.
Which is ironic, because maybe your server caused the hanging, but you never notice is since your tool just back offs and proceeds with the next request.

Formula of the chance one user would experience the 99th percentile

def f(n):
    return (1 - (0.99 ** n)) * 100

Service Time vs Response Time
Service Time -> Server is doing its stuff (calculating things, transforming data, talking to the DB)
Response Time -> Client is waiting for something the server processes

In some scenarios only the service time get's measured, which appears constant.
While the response time degraded and nobody sees that. So account also the response time in your measurments.

CTRL+Z Test
Using ctrl-z on a forever running task will stop it. You can resume the task via fg <job nr>. And via jobs you can view how many jobs are suspended and which <job nr> they have.

Stopping your server under load, could reveal the presence of "coordinated omissions".

Because if we measure the latency every 1ms for 100s, we have 10_000 data points. If we stop the server after 100s for another 100s, often the monitoring system will only store one data point. And since the server does not respond, the measuring tool stops too.

This skews the percentiles and hides the really bad stuff. The correct way would be to continue the measurment, even if the server is not responding. And then the data reveals the problem.

.002 About Latency #2

Quoting important sections of the blog here.

Latency is defined as the time it took one operation to happen. This means every operation has its own latency—with one million operations there are one million latencies.

As a result, latency cannot be measured as work units / time. What we’re interested in is how latency behaves. To do this meaningfully, we must describe the complete distribution of latencies.

Latency almost never follows a normal, Gaussian, or Poisson distribution, so looking at averages, medians, and even standard deviations is useless.

Remember that latency is not service time. If you plot your data with coordinated omission, there’s often a quick, high rise in the curve.

Run a “CTRL+Z” test to see if you have this problem. A non-omitted test has a much smoother curve. Very few tools actually correct for coordinated omission.

4 Programming

5 Management / Company

.001 Managing People: What works, what doesn't

tbd

  • Book: Peopleware: Productive Projects and Teams
  • Author: Tom DeMarco, Tim Lister
  • ISBN-13: 978-0321934116

.002 Why people don't want to work at your company

tbd