Binary self contained structured logs for fun and profit
By andre
The number of large software systems in the world is growing steadily. Old first generation solutions get improvements and grow more sophisticated. Completely new areas open up for automation. More systems, more lines of code. More complexity everywhere. Inevitably, more questions about the behaviour - why did it do that??? And, of course, more errors, from bugs in the systems themselves to malfunctioning of external dependencies and processes.
There are various tools that software developers can utilise in order to deal with the complexities of their systems. One of the core ones is known as logging. The sea captains of the old kept meticulous records of their voyages in the ship’s log, allowing those who followed to know more about the state of the currents and the winds, the lands and their natives. In similar fashion, software developers of today add their own log entries at various points of execution to indicate what the program is doing or planning to do, together with other relevant morsels of information. At some later point these records can be examined in order to learn what the system was doing at that particular time.
In practical terms, when zooming in to the level of individual source code lines, one is likely to encounter constructions similar to this:
if (!string.IsNullOrEmpty(rejectReason)) {
logger.info("Reject order {0} because '{1}'", order.id, rejectReason);
order.reject(reason);
}
The “logger.info” line is what creates a log entry. Where and how is the whole point of this article.
In most common scenario the above statement will generate a string, looking something like this:
2018-09-16 20:43:44,653 INFO OrderProcessing.xxx:35 [MainThread] Reject order e3b0c44298fc1 because 'Instrument not found'
This string contains lots of useful information. Aside from the actual message we constructed in the function call, it also has the time stamp, the code location this log entry derives from, the severity level and the invoking thread. All this will be useful when reconstructing the sequence of events later, when there are questions to answer.
Next, this string will be sent to a “sink”. Most often this means a file on a local disk. Files are great for storing reams of data, when it must be persisted and might be searched, but is usually accessed sequentially and not often. And this is indeed how we most often go through the logs - line by line, although all sorts of alternatives scenarios happen all the time.
Writing into files isn’t, unfortunately, free. Nothing good is free in this world, as the reader has already realised through her own experience by now. There are two specific important costs we have to deal with here. First, actually writing a log line into a file slows things down. We don’t need to do a blow by blow analysis of the various caching techniques that the standard libraries and the underlying OS use to make disk I/O more palatable. We won’t even go into the specific performance improvement steps the logging libraries developers employ, such as batching, or writing logs on separate threads. The bottom line is - the more information we want to log, the more it will slow the system down, all other things being equal.
Second, once written, the log entries occupy space. Sometimes a lot of space. To give an example, a typical banking application I used to work on would generate roughly 50GB of log data per day, five days a week excluding holidays. This adds up to many terabytes a year. And this is just one of many many applications the bank had, running daily. Things are similar in other places. True, storage space today is way cheaper and more readily available than ever before. But it isn’t free. And it isn’t infinite. And the more storage you use, the more effort you need to dedicate to managing it.
Thus, we reach our initial conclusion - logs are great, but would be even greater, if producing and storing them would waste fewer valuable resources - namely, execution time and storage space.
This is where my idea of the binary logging comes in.
Writing log entries
The idea isn’t very complicated. Neither is it original, nor mine for that matter. It is being used, as far as I’m aware, in a number of embedded systems, where space and time are at a much higher premium that in many other modern software development domains. In order to understand the idea, let’s update our original code example like this:
if (!string.IsNullOrEmpty(rejectReason)) {
logger.info(new OrderRejectLogMessage(order.id, rejectReason));
order.reject(reason);
}
How is this snippet different? As you can see, we now have a new type called OrderRejectLogMessage. Logically, this type represents the same message as we were writing out explicitly before. Its definition can be imagined to be something like this:
class OrderRejectLogMessage: BaseLogMessage {
static string readonly MessageTemplate = "Reject order {0} because ‘{1}’”;
private string id;
private string message;
public OrderRejectLogMessage(string id, string message) {
this.id = id;
this.message = message;
}
}
Now, when the log entry is created by calling “logger.info”, the program will no longer construct a string as above. Instead, it will create its binary representation, taking up as little space as possible. How little? Let’s estimate.
For the metadata, i.e. everything but the actual message itself, we will need to store the time stamp first of all, that will be an 8 bit number; next - the level of severity, which we can safely fit into one byte; next - thread name - we will do with 2 bytes only by indexing (more on what it means in a moment); next - the location in the source code - by using indexing, 4 bytes. We will also need to store the indicator of which message class this entry belongs to - that will also take 4 bytes through indexing. Overall this gives us 19 bytes. We will also prefix the entry with a record type indicator (keep reading), which will be one byte. 20 bytes in total for the metadata. In our previous version the same data took us 66 bytes. 46 bytes saved.
Next, we will need to write the instance data for OrderRejectLogMessage, that is, the actual log message. Since it has only string fields, the binary output won’t save us much per individual entry. However, if the relevant parameters were of different type, such as int or double, possessing more efficient binary representation, we could save even more here. And we don’t actually store the template, which in our case takes 25 bytes. So, 71 bytes saved on one message.
Let’s explain now the earlier reference to indexing, by starting with thread ids. In a process, there can be only so many threads over its lifetime. Instead of writing the full name of the thread, we can create an index, which will assign a number to each unique name, or system level thread id, and then refer to this number in our log entries. I think that 2 bytes should be enough for a typical application.
Now, how will it work in practice? The logging library will keep a runtime map of all thread names with their assigned shortcut ids. When it sees a new name, it will create a new entry in the map, and it will also create a special index entry in the log file. This index entry will start with a different prefix from the log entries. And it will contain all the necessary information to define the mapping of this new name. The size of this index entry isn’t of great importance, since they will occur very infrequently. The same logic applies to the source code positions, with the caveat that there will be potentially many more of those (but still only one entry per source code log entry creation call), hence 4 bytes allocation.
The last item to address is the encoding of the message type, OrderRejectLogMessage in our case. This will be implementation specific. In this article’s code examples I have used C#-like syntax. Let’s stick with this language for the sake of consistency. A simple way to capture the message type is by using the former indexing technique on the full type name of the log entry. It would be as simple for many other languages that support reflection. Things would be more complicated in C++ or Rust, but there, again, implementation-specific solution can be found.
Reading log entries
So far so good - we created millions of log entries in the new binary format and saved ourselves a lot of bytes in storage as well as many CPU cycles. Now, a support request comes in and it is time to look at some of these entries. You have probably guessed already, we must have a utility, that would be able to read the binary format we generated earlier. Luckily, the log file is self describing (with an important caveat we’ll discuss in a moment) and contains all the information we need to decode it, with a help from a small utility. The metadata is of fixed and known size. Also, remember, we stored each index entry back into the file, so it’s waiting there to be read. For each log entry, as soon as we read the message type, we know how many parameters and of which types in contains, so we can read those in an unambiguous manner.
The main question is, as I have hinted before, is the exact format of the message types, such as OrderRejectLogMessage. How do we describe these? To reiterate, this will be implementation-specific. For example, in a reflection-capable language like C# or Java, we can simply read the class metadata at the first creation of an instance. Our library implementation could apply simple restrictions on the type. For instance, it must derive from a predefined base class and have no other parents or implement any other interfaces. It should have only class member variables of certain types (primitive types + any number of classes we elect to support), as well as a static member variable of type string called Format. This Format plus all the types of the instance member variables will then be dumped into our log file in a special entry type describing our message type. This will be done only once, obviously, and will be part of the message type index.
In other languages, such as C++,or Rust, we will have to do more work to achieve the same result, but it can be done with sufficient forethought and effort. Ideally, the format should be language independent, so that logs written by different implementations of the concept can then be read by a single tool.
Pros and cons
Let’s now analyse the advantages and the disadvantages of the proposed binary logging format.
Pros
The obvious two have already been mentioned - we save computation time and storage space. But this is not all of it! There are at least two other additional useful features in this idea. First of all, instead of having a generic free form log strings, we have moved to structured, named records! This can be invaluable in some environments, that need to deal with many complicated support queries! Structured records are much easier to process, analyse and mine for data automatically. Log entries are, naturally, parsed and analysed already on daily basis. Elasticsearch, a widely popular indexing solution, is often used as the storage and query tool for the resulting data. However, often a small change in the code that generates log lines often results in cascading follow up misalignments down the pipeline of log parsing and handling. Even something as trivial as spelling mistake correction might break a regular expression for logstash parser that someone lovingly built balancing their laptop on the left knee. With structured messages a simple change in the wording of the log message itself won’t affect automatic analysis. Even if we change the fields a bit, this might still be possible to adjust automatically, depending on the type of the change. And the log entries are already classified, naturally! We can, of course, add more metadata to the message types, such as categories, tags etc. And of course, the actual task of parsing and storing the log entries becomes so much easier.
In summary, with regular logging we take structured objects, such as orders, products, users and so on, and convert them into formless strings, only to try and convert them back, using tools or in our heads, back into some structure. The binary logging approach can prevent the loss of structure, making support so much easier.
Second, the usage of message types encourages good habits. For example, the designers of a system that often deals with objects of type Order might insist, that every log message pertaining to an instance of such an object would contain its id. However, with generic strings there is no way of enforcing this. With message types, it’s just a matter of requiring Order argument in the constructor.
Cons
The most obvious drawback of this proposal is the loss of immediate human readability. The traditional log entries can, at least in theory, be read as they are. In practice, of course, to understand what a log line actually means requires prior knowledge of the system and often access to the source code. Still, at least the information can be scanned by the naked eye with nothing more than a text editor or a command line paging utility. The necessity of having an additional tool to actually open and decode the log file does add new complexity.
Another important factor is the increased fragility of the log file itself. Traditional log files are very resilient to local damage. If somehow a few lines in a very large file get corrupted, the reader can simply skip them and resume reading at the next undamaged line. Unless the investigated problem happens to be related to the affected area, this is not a big deal. With binary format, even a single mangled byte can render the rest of the file useless. This probably can be addressed with “smarter” format, containing some error checking/recovery, as well as any number of techniques already in existence. After all, developers of databases and many other solutions, using binary files for storage, have been working on such problems for years.
Conclusion
The binary self contained log writing approach might prove to be a useful tool for an organisation with large volumes of log data and high requirements in supportability. It will reduce the processing times and storage requirements, while imposing better coding practices and providing more structured output than the traditional log libraries. All this is achieved at a manageable cost.