Binary self contained logs for modern high throughput systems

The number of large software systems in the world is growing steadily. Old first generation solutions are improved and made more sophisticated. Completely new areas open up for automation. More systems, more line of code. More complexity everywhere. And, inevitably, more questions about their behaviour. And, of course, more errors, from bugs in the systems themselves to malfunctioning of external dependencies and systems.

There are various tools creators use to deal with the complexities of their systems. One of the fundamental ones is known as logging. Similar to the sea captains of the old, who kept meticulous records of their voyages in the ship log, allowing those who followed to know more about the state of the currents and winds, modern software developers add their own log entries at various points of processing. In practical terms, when zooming in to the level of individual source code lines, one is likely to encounter constructions similar to this:

  1. if (!string.IsNullOrEmpty(rejectReason)) {
  2.     logger.info("Reject order {0} because '{1}'", $order.id, rejectReason);
  3.     order.reject(reason);
  4. }

The "logger.info" line is what creates a log entry. Where and how it creates it is the whole point of this article's discussion.

In most common scenario a string, looking something like this will be generated by the call above:

  1. 2018-09-16 20:43:44,653 INFO OrderProcessing.xxx:35 [MainThread] Reject order e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 because 'Instrument not found'

This string contains lots of useful information. Aside from the actual message we constructed in the function call, it also has the time stamp, the code location the log entry derives from, the level and the invoking thread. All this will be useful when reconstructing the sequence of events later, when there are questions to answer.

Next, this string will be sent to a sink. Most often this means a file. Simply put, nothing beats files for storing rims of data, which must be persisted and might be searched, but is usually accessed sequentially. And this is indeed how we most often go through the logs - line by line, although all sorts of alternatives scenarios happen all the time.

Storing in files isn't, unfortunately, free. Nothing good is free in this world, as the reader has already realised through her own experience by now. There are two specific important costs we have to deal with here. First, actually writing a log line into a file slows things down. We don't need to do blow by blow analysis of the various caching techniques that the standard libraries and the underlying OS use to make disk I/O more palatable. We won't even go into the specific performance improvement steps the logging libraries developers employ, such as batching, or writing logs on separate threads. The bottom line is - the more information we want to log, the more it will slow the system down, all other things being equal.

Second, once written, the log entries occupy space. Sometimes a lot of space. To give an example, a reasonably typical banking application I used to work on would generate roughly 50GB of log data per day, five days a week excluding holidays. This is many terabytes a year. And this is just one of many many applications the bank had, running daily. Things are similar in other places. True, storage space today is way cheaper and more readily available than ever before. But it isn't free. And it isn't infinite. And the more storage you use, the more effort you need to dedicate to managing it.

Thus, we reach our initial conclusion - logs are great, but would be even greater, if producing and storing them would waste fewer valuable resources - namely, execution time and storage space.

This is where my idea of the binary logging comes in.

Writing log entries

The idea isn't very complicated. Neither is it original, or mine for that matter. It is being used, as far as I'm aware, in a number of embedded systems, where space and time are at a much higher premium that at many other modern applications. To explain it, let's update our original code example like this:

  1. if (!string.IsNullOrEmpty(rejectReason)) {
  2.     logger.info(new OrderRejectLogMessage($order.id, rejectReason));
  3.     order.reject(reason);
  4. }

How is this snippet different? As you can see, we now have a new type called OrderRejectLogMessage. Logically, this type represents the same message as we were writing out explicitly before. Its definition can be imagined to be something like this:

  1. class OrderRejectLogMessage: BaseLogMessage {
  2. 	static string readonly MessageTemplate = "Reject order {0} because ‘{1}’”;
  3.  
  4. 	private string id;
  5. 	private string message;
  6.  
  7. 	public OrderRejectLogMessage(string id, string message) {
  8. 		this.id = id;
  9. 		this.message = message;
  10. 	}
  11. }

Now, when the log entry is created by calling “logger.info”, we will no longer construct a string as above. Instead, we will create its binary representation, taking up as little space as possible. How little? Let’s estimate. For the metadata we will need to store the time stamp first of all, that will be an 8 bit number; next - the level of severity, which we can safely fit into one byte; next - thread name - we will do with 2 only by indexing (on what it is in a moment); next - the location in the source code - by using indexing 4 bytes. We will also need to store the indicator of which message class this entry belongs to - that will also take 4 bytes through indexing. Overall this gives us 19 bytes. We will also prefix the entry with a record type indicator (on that in a moment), which will be one byte. 20 bytes in total for the metadata. In our previous version the same data took us 66 bytes. 46 bytes saved.

Next, we will need to write the instance data for OrderRejectLogMessage. Since it has only string fields, the binary output will be the same as if we were creating a string representation. However, if the relevant parameters were of different type, such as int or double, possessing more efficient binary representation, we could save even more here. And we don’t actually store the template, which in our case takes 25 bytes. So, 71 bytes saved on one message.

Let’s explain now the earlier reference to indexing, by starting with thread ids. In a process, there can be only so many threads over its lifetime. Instead of writing the full name of the thread, we can create an index, which will assign a number to each unique name, or system level thread id, and then refer to this number in our log entries. I think that 2 bytes should be enough for a typical application.

Now, how will it work in practice? The logging library will keep a runtime map of all thread names with their assigned shortcut ids. When it sees a new name, it will create a new entry in the map, and it will also create a special index entry in the log file. This index entry will start with a different prefix from the log entries. And it will contain all the necessary information to define the mapping of this new name. The size of this index entry isn’t of great importance, since they will occur very infrequently. The same logic applies to the source code positions, with the caveat that there will be potentially many more of those (but still only one entry per source code log entry creation call), hence 4 bytes allocation.

The last item to address is the encoding of the message type, OrderRejectLogMessage in our case. This will be implementation specific. In our code examples I used C#-like syntax. Let’s stick with this language for the sake of an example. A simple way to capture the message type is by using the former indexing technique on the full type name of the log entry. It would be as simple for many other languages, that support reflection. Things would be more complicated in C++ for example, but there, again, implementation-specific solution can be found.

Reading log entries

So far so good - we created millions of log entries in the new binary format and saved ourselves a lot of bytes in storage as well as many CPU cycles. Now, a support request comes in and it is time to look at some of these entries. You have probably guessed already, we must have a utility, that would be able to read the binary format we generated earlier. Luckily, the log file is self describing (with an important caveat we’ll discuss in a moment) and contains all the information we need to decode it. The metadata is of fixed known size. Also, remember, we stored each index entry back into the file, so it’s waiting there to be read. For each log entry, as soon as we read the message type, we know how many parameters and of which types in contains, so we can read those in a predictable manner.

The main question is, as has been hinted before, is the exact format of the message types, such as OrderRejectLogMessage. How do we describe these? To reiterate, this will be implementation-specific. For example, in a reflection-capable language like C# or Java, we can simply read the class metadata at the first creation of an instance. Our library implementation could apply simple restrictions on the type. For instance, it must derive from a predefined base class and only from it. It should have only class member variables of certain types (primitive types + any number of classes we elect to support), as well as a static member variable of type string called Format. This Format plus all the types of the instance member variables will then be dumped into our log file in a special entry type describing our message type. This will be done only once, obviously, and will be part of the message type index.

In other languages, such as C++, we will have to do more work to do this, but it can be done with sufficient work. Ideally, the format should be language independent, so that logs written by different implementations of the concept can then be read by a single tool.

Pros and cons

Let’s now analyse the advantages and the disadvantages of the proposed binary logging format.

Pros

The obvious two have already been mentioned - we save computation time and storage space. But this is not all of it! There are at least two other additional useful features in this idea. First of all, instead of having a generic free form log strings, we have moved to structured, named records! These are much easier to analyse and mine for data automatically. Log entries are, naturally, parsed and analysed already on daily basis. However, often a small change in the code that generates them often results in cascading follow up breakages down the pipeline of log parsing and handling. Even something as trivial as spelling mistake correction might break a regex logstash parser someone lovingly built balancing their laptop on the knee. With structured messages a simple change in the wording of the log message itself won’t affect automatic analysis. Even if we change the fields a bit, this might still be possible to accommodate automatically, depending on the type of the change. And the log entries are already classified, automatically! We can, of course, add more metadata to the message types, such as grouping information etc.

Second, the usage of message types encourages good habits. For example, the designers of a system that often deals with objects of type Order might insist, that every log message pertaining to an instance of such an object would contain its id. However, with generic strings there is no way of enforcing this. With message types, it’s a matter of requiring Order argument in the constructor.

Cons

The most obvious drawback of this proposal is the loss of immediate human readability. The traditional log entries can, at least in theory, be read as they are. In practice, of course, to understand what a log line actually means requires prior knowledge of the system as well as access to the source code. Still, at least the information can be scanned by the naked eye with nothing more than a text editor or a command line paging utility. The necessity of having an additional tool to actually open and decode the log file does add new complexity.

Another important factor is the increased fragility of the log file itself. Traditional log files are very resilient to local damage. If somehow a few lines in a very large file get corrupted, the reader can simply skip them and resume reading at the next undamaged line. Unless the investigated problem happens to be related to the affected area, this is no big deal. With binary format, even a single mangled byte can render the rest of the file useless. This probably can be addressed with “smarter” format, containing some error checking/recovery, as well as any number of existing techniques already in existence. After all, developers of databases and many other solutions using binary files for storage have been working on such problems for years.

Conclusion

The binary self contained log writing approach might prove to be a useful tool for an organisation with large volumes of log data. It will reduce the processing times and storage requirements, while imposing better coding practices and providing more structured output than the traditional log libraries. All this is achieved at a manageable cost.

Add new comment