Logging Operations And Format

Logging is built-in to ansar. It starts happening the moment that async objects - such as threads, processes and machines - come into existence. They are a deliberate by-product of async execution.

This document covers the lifecycle of a log. From how they are generated, their journey through a process and where they may end up. Along the way there are details about their layout and operational considerations - are they thread-safe, how much resource do they consume and how can they be turned off? Focused effort has also delivered efficient, self-maintaining storage of logs and quick access for offline analysis.

Everything Starts At The Object

Consider the complete async application, zombie.py;

import ansar.create as ar

def zombie(self):
    self.console('Do nothing until interrupted')
    self.select(ar.Stop)
    return ar.Ack()

ar.bind(zombie)

if __name__ == '__main__':
    ar.create_object(zombie)

A default execution looks like this;

$ dist/zombie
^C{
    "value": [
        "ansar.create.lifecycle.Ack",
        {},
        []
    ]
}
$

Except for the expected JSON encoding on stdout, and the printed representation of control-c, there is nothing to see - even after an apparent effort to say something. This is because the default policy is to discard all logs. There must be a deliberate request before logs emerge from within the async machinery;

$ dist/zombie --debug-level=CONSOLE
[01318065] 2020-02-17T07:41:05.007 ^ <00000009>zombie - Do nothing until interrupted
^C{
    "value": [
        "ansar.create.lifecycle.Ack",
        {},
        []
    ]
}

Passing the --debug-level=CONSOLE argument has redirected the flow of logs onto stderr. Along with a few familiar and not-so-familiar items, there is at least a connection between the call to console() in the Python code and the text coaxed out of dist/zombie. For immediate explanation of the not-so-familiar items, look here.

There are 7 methods for generating logs, available to every Point-derived object within an application. These are;

Name	Notes
`debug()`	Uncurated, development information.
`trace()`	Reviewed, operational information.
`sample()`	Machine-readable, runtime values.
`console()`	Application information.
`warning()`	Unexpected condition.
`test()`	Formal verification of condition.
`fault()`	Something is broken.

The sample() function accepts a set of named values, i.e. sample(x=1.0, y=-1.0) and logs a structured presentation of that information.

self.sample(postion=GlobalPosition(x, y, z))
self.sample(health=health_sample())
self.sample(api=UserAuthentication(login, secret))
self.sample(x=63.0, y=1.101)

The last call produces;

2019-06-09T19:51:04.661 & <0000000a>tracker - x=63.0:y=1.101

There is a tacit priority to these methods, from the low debug() through to the high fault(). The test() method is different to the others. It checks the condition parameter and if that evaluates to false, passes the second parameter to warning(); it is the ansar equivalent of an assert.

All of these methods generate a PointLog object and send it to an internal collection point. They all return immediately the object is queued and there is no return value. All logs generated by a particular async object will be collected in the order they were generated. The collection order of logs originating from multiple different objects is non-deterministic.

Logging Done For You

Requesting all the logs available from zombie looks like this;

$ dist/zombie --debug-level=DEBUG
[01320581] 2020-02-17T09:29:09.142 + <00000008>start_vector - Created by <00000001>
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Executable "/home/dennis/gh/some/repo/dist/zombie" as process (1320581)
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Working folder "/home/dennis/gh/some/repo"
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Running object "__main__.zombie"
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Class threads (1) "retries" (1)
[01320581] 2020-02-17T09:29:09.142 + <00000009>zombie - Created by <00000008>
[01320581] 2020-02-17T09:29:09.142 ^ <00000009>zombie - Do nothing until interrupted
^C[01320581] 2020-02-17T09:29:10.946 < <00000008>start_vector - Received Stop from <00000001>
[01320581] 2020-02-17T09:29:10.946 > <00000008>start_vector - Sent Stop to <00000009>
[01320581] 2020-02-17T09:29:10.946 < <00000009>zombie - Received Stop from <00000008>
[01320581] 2020-02-17T09:29:10.946 X <00000009>zombie - Destroyed
[01320581] 2020-02-17T09:29:10.947 < <00000008>start_vector - Received Completed from <00000009>
[01320581] 2020-02-17T09:29:10.947 X <00000008>start_vector - Destroyed
{
    "value": [
        "ansar.create.lifecycle.Ack",
        {},
        []
    ]
}

The async machinery logs its own actions. Object creation/destruction, message sending/receiving and sub-process initiation/termination - all of these activities generate logs. The very first line records the creation of the very first async object - an instance of the start_vector object. This is the object that exists between the host operating system and the application object eventually created by create_object(); it takes care of issues such as signal handling.

Software that exploits the async model of operation tends towards a style of many small objects, sending and receiving messages. Courtesy of the built-in logging this type of software is almost “self-documenting”; when problems arise there is already good information in the logs to support bug hunts.

Application-level logging should use console(). Logging at this level records the operation of a business model; CRUD operations on a database and RESTful API requests. Logging of lower-level detail about files, devices and sockets should use debug() and trace(). The former is intended for internal use, e.g. it may not receive the QA reviews enjoyed by the latter.

Internal Collection Of Logs

All logs are received at a single async object - an internal part of the async machinery. This runs on its own dedicated thread with its own dedicated message queue. That queue is quite large to give the downstream processing of logs time to catch up with “burst-y” activity.

To preserve integrity of the log stream the queue is set to block-on-full, i.e. when the collection point eventually becomes flooded it “pushes back” on the sources. Those sources include every other async object within the host process. Most messaging within ansar is based on fall-on-the-floor behaviour - messages sent to a full queue are quietly discarded. In the world of multi-threading blocking is both good and bad. Feedback loops are generally good - slowing down the rate of application activity gives the rest of the system more resources for its recovery. Blocking also creates a potential for deadlocks. For this reason all logging in any downstream processing of logs is disabled, i.e. the logging subsystem cannot generate logs about itself.

Trash, Stderr Or Rolling Log

There are 3 defined destinations for ansar logging - the trash, stderr or an internal rolling log. A selection is made inside create_object(), based on special command line arguments. An application in tool mode will either toss it all in the trash, or filter it and route it to stderr, as directed by the presence/absence (and value) of a --debug-level argument. An application running in component mode routes all logs to either stderr or the rolling log, depending on whether a run or start command is being used (refer to this).

The rolling log is a self-maintaining FIFO of logs. Processes can generate days, months or years of logs without requiring any attention. It will never create a disk-full condition, months later at some inconvenient moment. This comes at the expense of data loss - a rolling log discards its oldest records as necessary to keep disk consumption below a defined maximum.

Querying for logs for analysis purposes is based around entry of start and end times. Scanning from the oldest to the latest, the query looks for the first timestamp equal-to or greater-than the start time. It then starts printing the logs while also looking for the end time. There are several optimizations used in the initial search for the start time.

There is a low possibility that a query will land on a start time, at the same moment that the rolling log decides to discard a block of its oldest records, including that landing point. At such a moment, filesystem errors are the most likely result. The query should be re-submitted.

Logs Look Like This

Consider the initial output from zombie;

[01320581] 2020-02-17T09:29:09.142 + <00000008>start_vector - Created by <00000001>
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Executable "/home/dennis/gh/some/repo/dist/zombie" as process (1320581)
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Working folder "/home/dennis/gh/some/repo"
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Running object "__main__.zombie"
[01320581] 2020-02-17T09:29:09.142 ~ <00000008>start_vector - Class threads (1) "retries" (1)
[01320581] 2020-02-17T09:29:09.142 + <00000009>zombie - Created by <00000008>

There are 6 columns in the output. Presentation includes visual structuring to assist with long periods of viewing and to also allow for concise machine processing. Up to the object name the presentation appears to be fixed-width - technically they should be considered variable. Process IDs are zero-padded to 8 digits but hosts may assign values beyond the associated maximum. The ISO format is fixed at 23 characters. All the sub-fields are zero-padded and do not overflow. Precision is always to the millisecond. Object IDs are serial integers, printed in hex and zero-padded to 8 digits. Both the object name and the text supplied by the caller are variable width.

The columns are;

Name	Index	Type	Notes
Process ID	0	integer	Platform identity, unique within a host.
Time	1	ISO date-time	Instant at the point of origin.
Tag	2	character	Identity, action prompting the log.
Object ID	3	integer	Async identity, unique within a process.
Object Name	4	string	Name of the function or class.
Text	5	string	Free format text, ASCII with backslash replacements.

logs from an ansar run include the process ID as multiple processes are expected.
logs from an ansar log command do not include a process ID.
logs from an ansar log can seamlessly span the activities of multiple consecutive processes.
the time from an ansar log may be a local time - denoted by a lowercase ‘t’.
process IDs are not guaranteed to be unique over time.
object IDs are serial integers printed in hex (lowercase alphas).
tags are an enumeration of action and level, refer to following section.
object names come from the actual Python function or class.
text passed to logging is post-processed to preserve structure of logs.

Log Levels And Tags

A single line of logging looks like this;

[01320581] 2020-02-17T09:29:09.142 + <00000008>start_vector - Created by <00000001>

Positioned between the timestamp and an object ID sits a plus sign (i.e. +). This is a logging tag - a single character from a fixed set of possible characters. The tag imparts 2 items of information - it is firstly an enumeration of an action but also indicates a priority, or level;

Tag	Action	Level	Notes
_	Debug	DEBUG	Information for developers
~	Trace	TRACE	Information for technical support
&	Sample	CONSOLE	Record of named values
+	Created	OBJECT	Async information, object created
X	Destroyed	OBJECT	Async information, object destroyed
>	Sent	OBJECT	Async information, object sent a message
<	Received	OBJECT	Async information, object received a message
(	Started	OBJECT	Async information, sub-process started
)	Ended	OBJECT	Async information, sub-process terminated
^	Console	CONSOLE	Activity at application level
?	Warning	WARNING	There is a problem that may compromise operations
=	Check	WARNING	An ansar assert has failed, a warning
!	Fault	FAULT	There is a problem that has or will compromise operations

Selection of a level on commands such as ansar run (e.g. --debug-level=CONSOLE) declares an interest in the information at the specified level and above. Everything at a lower level is discarded. The levels and the associated application methods appear below;

Level	Method
DEBUG	`debug()`
TRACE	`trace()`, `sample()`
OBJECT	none
CONSOLE	`console()`
WARNING	`warning()`, `test()`
FAULT	`fault()`