Checklist for Developers of Robust Enterprise Services

By Sergey Teplyakov November 21, 2016 6:00 PM

After having read Release It! I wanted to capture several thoughts about development of distributed enterprise software. This a list of things you need to address in order to develop better systems:

Everything that can fail will fail

A database will go down; a remote web service starts being slow; credentials will expire; Azure will throttle your requests… This means that any communication with external systems should account for this. Unfortunately, this is easier to say than implement because modern libraries for distributed communications are too damn complex and just but looking at the API that they expose you cannot reason about what can go wrong, which exceptions can be thrown and how you need to handle them.
Fail fast

There is nothing worse in a system than its slowness. If something is broken, no one will be able to use. But if it works, but really slowly, it’s not clear what to do with it and where the problem lays. It can be temporary; it can be due to bad connection and may not be related to the system failure.

That’s why if a system is overloaded it’s better to explicitly reply with “Too many requests” and don’t allow all users struggle by processing their queries very slowly.

Zootopia - Meet the Sloth
Use the pattern Circuit Breaker

Every queue for processing can get overloaded. This can be a local queue for writing to a log file or the stream of incoming requests. Every queue should have the possibility to stop accepting new messages.

For example, BlockingCollection in .NET will just block all requests for adding new elements and in the case of processing HTTP requests there is status 429 – Too Many Requests for this.

The Circuit Breaker pattern helps to implement the kind of logic which allows skipping the communication with a remote service if it is not doing well. The idea of the pattern is that calls to a remote service (or any other resource) are decorated with an object which can be in one of two states: closed - when the decorated resource is operating normally or opened - when the decorated resource is unavailable or overloaded.

If the circuit breaker is in the opened state, no calls to the decorated resource will be made. Instead, an error will be produced immediately. The transition to the opened state can occur after, say, N consequent errors, and the transition to the closed state occurs when the resource is operating normally again (for this, the circuit breaker while being in the opened state still makes periodic calls to the decorated resource in order to find out if the resource is available).
Use caching responsibly

When I was working on AppInsights our colleagues added a caching mechanism. That was good. What was bad is that the caching mechanism was not transparent. Instead of caching results of real requests, the cache was warmed during the start-up and all requests were served by the cache only. As the result, the start of the service was overloading the backend and the bug in the synchronization logic lead to the cache being always stale for new users.

Every cache should have a proper invalidation strategy. Without one it just becomes a huge memory leak.

In addition to that, caching should never lead to correctness problems (like in the aforementioned case). Broken caching should result in a worse performance and that’s it.
Log all remote communications

There are certain things which will be helpful during debugging. For example, it is important to know which part of the system is failing; how much time every stage of communication takes (authorization, request processing, dns resolution etc.) and what exactly happens on failing requests to a service; whether caching is working and how many items are cached.
Configure logs right

There are certain types of system failures which are hard to manage in the managed systems: these are exceptions related to the shortage of operating memory; stack overflows and disk space shortage.

Stack overflow manifests itself quite obviously: the app crashes. On the contrary, the shortage of memory or disk space can lead to very different consequences: weird errors while uploading builds, weird bugs during serialization/deserialization etc.

To avoid this, you need to tune your logging: don’t log to the sytem disk and set up a proper mechanism for log rotation.
Make a gray box out of your system

There is nothing worse than trying to guess what’s going on having not enough information. There are standard ways to make a service more transparent:
- Telemetry. Your own or something like AppInsights which allows collection key indicators: usage patterns, average request processing time/number of requests or standard metrics such as performance counters on Windows.
- Performance counters. Any system has hundreds of counters: starting with i/o operations and ending with the number of garbage collections.
- Monitoring. The counters have to be monitored and, for example, send out notifications when certain metrics leave the “normal” range.
- Logs. It is important to not pollute the logs with irrelevant information. Rather try to log messages which allow to understand the current situation and take some actions. For example, overflows of various queues or pools and other important components.

This list is far from being comprehensive but it is something to start with. Thanks for reading.