Multi-Processing Is More Than Forking

Multi-threading and multi-processing are two techniques that can be applied to challenges such as concurrency. Within the Python eco-system there is extra motivation to consider multi-processing due to the internal interpreter architecture (i.e. the GIL).

However, multi-processing comes with obvious and less obvious difficulties. Creation and management of processes (e.g. detection of termination) is certainly different to the creation and management of threads, and any transfer of information now has to cross an inter-process boundary. But even after solving these non-trivial problems there is a much larger one to consider.

Note

Source files appearing in this section can be downloaded from here. The repo Makefile contains the setup needed for this guide. Related background information can be found here.

Beyond Forking

Adoption of async programming techniques and the Process object type provide the means to develop sophisticated, multi-processing software. This is programmatic, parent process control over a set of child processes. A custom piece of software creates and manages a specific, logically associated set of processes.

There are also a significant number of scenarios where a more generic tool would be good enough. A tool that can load a description of the processes needed would solve a lot of development - and possibly operational - requirements. Not having to write that custom supervisor process each time is a compelling thought. A single command would start all the processes in the description and a single command would stop them.

A difficulty hidden within this generally good idea is that most substantial applications require runtime resources such as disk space for configuration files, logs and perhaps a database. Network ports are also an issue. These resource requirements are problematic because groups of processes will often include multiple instances of a common executable. Imagine an executable that controls a robotic arm. It would be entirely plausible to include two instances of the executable with different configurations such that the 2 processes behave like left and right arms. The 2 instances need distinct runtime environments. By default, copies of the same executable are not good at sharing.

This is what a container does for these situations. Products like docker use facilities in the underlying operating system (e.g. cgroups) to tackle the same essential problem. It can also be said that containers are a light version of virtualization, a technology delivered by products such as VirtualBox. These products solve a common problem - how to run multiple copies of software that is not otherwise capable of being in the same space.

Processes With Dual Modes

An executable based on create_object() has 2 distinct runtime modes. By default it runs as a process within the host operating system, on behalf of the current user. In this mode it is said to be running as a tool or utility. It typically loads its configuration from a file under the current user’s $HOME folder. Multiple instances of the same executable running for the same user, will load the same configuration.

Passing a few special arguments to that same executable at start time enables the second runtime mode. Those special arguments include a location and a name, enough information for create_object() to recover a disk management context. In this mode it is said to be running as a component. That disk management context contains private disk resources such as configuration files, database areas and space for logging. The effect is that processes running as components can happily co-exist in a group. Configuration can be modified on a per-process basis and each process has its own logs.

Creation of the contexts and passing the correct, associated arguments to executables is not something that should be attempted manually. This is a space intended for programmatic automation. Manual initiation of processes is intended for “tool” mode.

A nice by-product of this disk management is that there can be many groups of processes on the same host - just at different locations (i.e. folders). Its even possible to have multiple copies of the same group running on the same host, but in different folders. It should be noted that this does not address the wider issue of global resources. Management of resources such as network ports is outside the scope of this document.

A Tool For Practical Multi-Processing

The generic tool that brings everything together is the ansar command-line utility that comes with the ansar-create library. This single tool allows for the persistent description of a group of processes. Entries can be added, the contents of the group can be listed, members of the group can be updated and entries can be deleted.

Technically those CRUD operations are manipulating descriptive information - there are no platform processes being created or terminated during those operations. Quite separately the ansar tool also provides the ability to “start the group” or even a subset of the group. The current status of the group can be listed (i.e. print a table of currently running processes) and there is a “stop the group” operation. All processes started by the ansar tool are running in component mode within a disk management context. Lock files are used to prevent multiple copies of group processes.

A Quick Tour

A small collection of processes is used to demonstrate the use of the ansar tool. These are;

`noop.py`	does nothing and exits immediately
`snooze.py`	waits for a configured amount of time
`zombie.py`	does nothing until interrupted
`factorial.py`	calculates factorial(n) using recursive processes
`busy.py`	starts a tree of sub-processes
`server.py`	a very basic, sockets-based network server
`client.py`	a very basic, sockets-based network client
`analyzer.py`	a custom test analysis

An Ansar Command Line

The general layout of an ansar command appears below;

$ ansar [–<ansar-setting>=<value> ..] <sub-command> [-<sub-setting>=<value> ..] [word ..]

Each command involves a sub-command, optional settings and words. Settings appearing before the sub-command are more general and associated with ansar, while those appearing after are associated with the specific sub-command. The optional list of words is also associated with the sub-command.

A Sub-Command Example

Consider the following description of the create sub-command;

$ ansar create [<home-path>] [–redirect-<name>=<path> …]

The sub-command creates a home. It accepts an optional home-path as the location of the new home and an optional list of redirection settings. The home is subsequently configured with a list of processes to execute, and provides areas where each of those processes can store operational materials, e.g. logs.

$ ansar create --redirect-bin=dist
$ ls -la
total 220
drwxrwxr-x 10 dennis dennis  4096 Mar 25 04:02 .
drwxrwxr-x 12 dennis dennis  4096 Mar 21 14:32 ..
drwxrwxr-x 10 dennis dennis  4096 Mar 25 04:02 .ansar-home
drwxrwxr-x  2 dennis dennis  4096 Mar 24 16:34 dist
-rw-rw-r--  1 dennis dennis  1420 Mar 15 13:52 noop.py
..

In the absence of an explicit home-path, the path for this home has defaulted to .ansar-home in the current folder. A redirection of bin to the ./dist folder (i.e. the default output folder for pyinstaller executables) has been recorded within the new home. All executables named in subsequent ansar commands will be expected to exist in ./dist.

Note

Without the redirection of the bin folder, executables are expected to exist in a dedicated home sub-folder. In these scenarios executables are transferred to that dedicated folder using the deploy sub-command. Refer to later sections for further details.

Basic Behaviour

The commands used to populate a home with process definitions and create an operational set of those processes, are covered in the following sections.

Add A Process And Run It

$ ansar add noop
$ ansar list
noop-0
$ ansar run --debug-level=CONSOLE
19:40:25.847 ^ <00000009>noop - Log this and exit
{
    "value": [
        "ansar.command.ansar_command.Run",
        {
            "completed": [
                [
                    "noop-0",
                    [
                        "ansar.create.lifecycle.Ack",
                        {},
                        []
                    ]
                ]
            ],
            "home": ".ansar-home"
        },
        []
    ]
}
$

The add sub-command is used to add an instance of an executable to the default home. Technically, it adds a description of an instance, i.e. there are no new platform processes created by this command. The current set of descriptions are listed to confirm the new entry and then the entire list (i.e. the single instance of noop) is executed, using the run sub-command.

The requested logging (--debug-level=CONSOLE) is placed on stderr and the output from the run command is placed on stdout. The noop process logged its efforts and returned an Ack object to the run command. A full transcript of the console-level logging is included (minus the process id and full timestamps) for this first example; subsequent examples will omit logging that is not relevant to the demonstration.

Each instance of an executable is known by a role - a short name that describes the part the instance plays within the collection. In the above command both the role and the home have assumed default values. The role defaults to the name of the executable with a small suffix appended to it; the reasons for this behaviour will become clear in later sections.

A more explicit use of the add command looks like this:

$ ansar add robot-arm left-arm toy-robot --rotation=-90.0

This command says to add the left-arm instance of the robot-arm executable to the toy-robot home. The rotation setting for the new instance is initialized to -90.0.

Add A Process With Persistent Settings

$ ansar add snooze
$ ansar list
noop-0
snooze-0
$ ansar run --debug-level=DEBUG
33:10.063 + <00000007>lock_and_hold - Created by <00000001>
33:10.064 > <00000007>lock_and_hold - Sent Ready to <00000001>
33:10.064 + <00000007>lock_and_hold - Created by <00000001>
33:10.064 + <00000008>start_vector - Created by <00000001>
33:10.064 > <00000007>lock_and_hold - Sent Ready to <00000001>
33:10.064 ~ <00000008>start_vector - .. "../dist/snooze" ..
33:10.064 ~ <00000008>start_vector - Working folder ..
33:10.064 ~ <00000008>start_vector - .. "__main__.snooze"
33:10.064 ~ <00000008>start_vector - Class threads ..
33:10.064 + <00000009>snooze - Created by <00000008>
33:10.065 ^ <00000009>snooze - Do nothing for 2.0 seconds
33:10.065 > <00000009>snooze - Sent StartTimer to <00000003>
33:10.065 + <00000008>start_vector - Created by <00000001>
33:10.065 ~ <00000008>start_vector - .. "../dist/noop" ..
33:10.065 ~ <00000008>start_vector - Working folder ..
33:10.065 ~ <00000008>start_vector - .. "__main__.noop"
33:10.065 ~ <00000008>start_vector - Class threads ..
33:10.065 + <00000009>noop - Created by <00000008>
33:10.065 ^ <00000009>noop - Log this and exit
33:10.065 X <00000009>noop - Destroyed
33:10.065 < <00000008>start_vector - Received Completed ..
33:10.065 X <00000008>start_vector - Destroyed
33:10.165 < <00000007>lock_and_hold - Received Stop ..
33:10.170 X <00000007>lock_and_hold - Destroyed
33:12.067 < <00000009>snooze - Received T1 from <00000003>
33:12.067 X <00000009>snooze - Destroyed
33:12.067 < <00000008>start_vector - Received Completed ..
33:12.067 X <00000008>start_vector - Destroyed
33:12.068 < <00000007>lock_and_hold - Received Stop ..
33:12.071 X <00000007>lock_and_hold - Destroyed
$

A second role is created with the snooze executable. The home now has 2 entries. The run command - by default - starts all the processes described in the collection and waits for them all to complete. With the addition of a snooze, that completion now takes a few seconds. Logging still shows the immediate exit of the noop command. Both start at around the 10.064 mark and the noop object terminates 100th of a second later. The timer for snooze is not received until 12.067, about 2 seconds after the object was created.

$ ansar update snooze-0 --seconds=5.0
$ ansar run --debug-level=CONSOLE
19:49:35.903 ^ <00000009>snooze - Do nothing for 5.0 seconds
19:49:35.903 ^ <00000009>noop - Log this and exit
..

The seconds setting for the snooze-0 instance is assigned a longer value. Another run shows the snooze command behaving accordingly.

Add A Process That Never Wants To Terminate

$ ansar add zombie
$ ansar list
noop-0
snooze-0
zombie-0
$ ansar run --debug-level=CONSOLE
19:53:30.562 ^ <00000009>snooze - Do nothing for 5.0 seconds
19:53:30.563 ^ <00000009>noop - Log this and exit
19:53:30.563 ^ <00000009>zombie - Do nothing until interrupted
^C{
..
$

Adding an instance of zombie changes an essential behaviour of the collection; it no longer self-terminates. Both noop and snooze eventually terminate if given enough time, but user intervention is required to terminate zombie and the run inherits that requirement.

Note

Control-c is the standard command-line mechanism for terminating long running processes. In a standard async process, a control-c is caught by create_object() and converted to an Stop message. In response, every async process is expected to terminate gracefully. A control-c is caught by the ansar command and propagated to all its children, resulting in the shutdown of the run. The action also injects a circumflex-cee (^C) into the terminal output, disrupting the logging. Logs redirected to a file will will not include that disruption.

$ ansar delete zombie-0
$ ansar list
noop-0
snooze-0
$

A delete command is used to remove the zombie-0 role from the home. This both demonstrates the command and restores a more convenient behaviour for the purposes of this tour.

Add A Process That Expects Input

$ ansar add factorial
$ ansar run --debug-level=CONSOLE
[00438217] 2023-04-06T01:52:17.317 ^ <00000009>snooze - Do nothing ..
[00438216] 2023-04-06T01:52:17.317 ^ <00000009>fact - factorial(5)
[00438218] 2023-04-06T01:52:17.317 ^ <00000009>noop - Log this and exit
[00438245] 2023-04-06T01:52:17.418 ^ <00000008>fact - factorial(4)
[00438255] 2023-04-06T01:52:17.520 ^ <00000008>fact - factorial(3)
[00438265] 2023-04-06T01:52:17.621 ^ <00000008>fact - factorial(2)
[00438275] 2023-04-06T01:52:17.722 ^ <00000008>fact - factorial(1)
[00438285] 2023-04-06T01:52:17.823 ^ <00000008>fact - factorial(0)
{
    "value": [
        "ansar.command.ansar_command.Run",
        {
            "completed": [
            ..
                [
                    "factorial-0",
                    [
                        "lib.factorial_if.FactorialReturned",
                        {
                            "value": 120
                        },
                        []
                    ]
                ],
            ..
            ],
            "home": ".ansar-home"
        },
        []
    ]
}
$

An instance of factorial is added. This is a different executable in that it’s the first demonstration executable to create sub-processes. It uses the ansar ability to “call” a process as if it were a function, as a basis for a recursive implementation of the factorial function. The chain of processes can be seen in the logs - note the [00438216] process id on the <00000009>fact - factorial(5) log and how that id and log change as the chain extends.

$ ansar input factorial-0
{
    "value": 5
}
$

The input command can be used to view the initial input for the named role. This is a default encoding created during the add factorial command.

$ cat factorial-7
{
    "value": 7
}
$ ansar input factorial-0 --set-file=factorial-7
$ ansar run
{
    "value": [
        "ansar.command.ansar_command.Run",
        {
            "completed": [
            ..
                [
                    "factorial-0",
                    [
                        "lib.factorial_if.FactorialReturned",
                        {
                            "value": 5040
                        },
                        []
                    ]
                ],
            ..
            ],
            "home": ".ansar-home"
        },
        []
    ]
}
$

The input command can also be used to modify the initial input for the named role. Use the --set-file parameter to store new initial input for a role.

Redefining The Settings For A Process

$ cat short-snooze
{
    "value": {
        "seconds": 1.0
    }
}
$ ansar settings snooze-0 --set-file=short-snooze
$ ansar run --debug-level=CONSOLE
14:45:55.859 ^ <00000009>noop - Log this and exit
14:45:55.859 ^ <00000009>factorial - factorial(5)
14:45:55.859 ^ <00000009>snooze - Do nothing for 1.0 seconds
14:45:55.961 ^ <00000008>factorial - factorial(4)
14:45:56.062 ^ <00000008>factorial - factorial(3)
14:45:56.162 ^ <00000008>factorial - factorial(2)
14:45:56.263 ^ <00000008>factorial - factorial(1)
14:45:56.363 ^ <00000008>factorial - factorial(0)
{
    "value": [
        "ansar.command.ansar_command.Run",
        {
            "completed": [
            ..
                [
                    "snooze-0",
                    [
                        "ansar.create.lifecycle.Ack",
                        {},
                        []
                    ]
                ],
            ..
            ],
            "home": ".ansar-home"
        },
        []
    ]
}
$

Persistent settings associated with processes can be modified using the update command or the settings command. By accepting complete encodings the settings command provides for the full expression of ansar encodings, specifically including graphs.

Adding Some Workload

$ ansar add busy
$ cat busy-input
{
    "value": {
        "duties": [
            "noop",
            "snooze",
            "factorial"
        ],
        "management_levels": 5,
        "managers": 3
    }
}
$ ansar input busy-0 --set-file=busy-input
$ ansar run --debug-level=CONSOLE
22:46:31.177 ^ <00000009>snooze - Do nothing for 2.0 seconds
22:46:31.177 ^ <00000009>factorial - factorial(5)
22:46:31.178 ^ <00000009>noop - Log this and exit
22:46:31.295 ^ <00000008>noop - Log this and exit
22:46:31.295 ^ <00000008>factorial - factorial(4)
22:46:31.295 ^ <00000008>snooze - Do nothing for 2.0 seconds
22:46:31.296 ^ <00000008>factorial - factorial(5)
22:46:31.490 ^ <00000008>snooze - Do nothing for 2.0 seconds
22:46:31.499 ^ <00000008>factorial - factorial(4)
..
22:46:46.327 ^ <00000008>factorial - factorial(0)
22:46:46.339 ^ <00000008>factorial - factorial(0)
22:46:46.423 ^ <00000008>factorial - factorial(0)
{
    "value": [
        "ansar.command.ansar_command.Run",
        {
            "completed": [
                ..
                [
                    "factorial-0",
                    [
                        "lib.factorial_if.FactorialReturned",
                        {
                            "value": 120
                        },
                        []
                    ]
                ],
                [
                    "busy-0",
                    [
                        "lib.job_if.JobReturned",
                        {
                            "processes": 484
                        },
                        []
                    ]
                ]
            ],
            "home": ".ansar-home"
        },
        []
    ]
}
$

Recursion is again used to create a tree of busy processes that has a defined number of levels (i.e. management_levels) and a defined number of branches (i.e. managers). The run command results in the creation of 484 processes plus those entries previously added alongside the busy-0 entry.

Note

Interruption of complex and dynamic collections of processes is likely to catch some processes in the early stage of their lives. A control-c can interrupt the Python interpreter as it is performing import operations, long before the __name__ == "__main__" has even been reached. The consequence is that signals may be processed by the default handlers inside the interpreter; tracebacks will appear on stdout. The Process machine catches this event and converts it into an Aborted message, which is returned to the parent async object, preserving operational integrity.

More Advanced Use

Use of the run command results in immediate feeback. Logging from all the related sub-processes is placed on stderr for viewing, or saving in a file for off-line analysis. These are valuable development procedures.

The start command is similar to run except that after starting the related set of processes, control is immediately returned to the command-line, leaving the processes to continue in the background. Logging no longer appears on stderr but is instead appended to a per-process storage area. Further ansar commands provide access to those logs, as well as administration of the background processes.

Starting Processes In The Background

$ ansar update snooze-0 --seconds=10.0
$ ansar add zombie
$ ansar list
factorial-0
noop-0
snooze-0
zombie-0
$ ansar start
$ ansar status
snooze-0
zombie-0

Logging no longer appears on the terminal. The status command shows which roles within a collection are currently running. In this example, the use of status must have followed the start quickly enough to catch snooze-0 before it self-terminated.

$ ansar start
ansar: cannot perform "start", "(all)" currently running as - 589299
$ ansar status -l
zombie-0                 <589299> 2m25.6s
$ ansar stop
ansar: cannot perform "stop", "(all)" not currently running - busy-0, factorial-0, noop-0, snooze-0
$ ansar stop zombie-0
$ ansar start
$

Attempts to run multiple instances of a role are detected and reported. In this case it’s the zombie-0 role, verified by the matching process IDs in the error message and the (long) status output.

Commands that involve a role - e.g. run, start, stop and others - accept a role-search as a parameter. Omitting the parameter is assumed to mean “match everything”. Where the command encounters any form of mismatch between the intentions of the command and the current set of processes, it terminates with an error message. In the above case, the intentions of the command were to run a new set of all the processes in the group. Instead, it detected an operational member of the group and terminated. The --force ansar flag can be used to override that cautionary behaviour. This would cause the command to kill the operational instance of zombie-0 before creating the new set, including the new instance of zombie-0.

Reviewing Background Activity

$ ansar log snooze-0
$ ansar log snooze-0 --last=WEEK
41:03.329 + <00000007>lock_and_hold - Created by <00000001>
41:03.329 > <00000007>lock_and_hold - Sent Ready to <00000001>
41:03.330 + <00000008>start_vector - Created by <00000001>
41:03.330 ~ <00000008>start_vector - Executable "/home/brad/somewhere/dist/snooze" as process (369676)
41:03.330 ~ <00000008>start_vector - Working folder "/"
41:03.330 ~ <00000008>start_vector - Running object "__main__.snooze"
41:03.330 ~ <00000008>start_vector - Class threads (1) "retries" (1)
41:03.330 + <00000009>snooze - Created by <00000008>
41:03.330 ^ <00000009>snooze - Do nothing for 2.0 seconds
41:03.330 > <00000009>snooze - Sent StartTimer to <00000003>
41:05.332 < <00000009>snooze - Received T1 from <00000003>
41:05.332 X <00000009>snooze - Destroyed
41:05.332 < <00000008>start_vector - Received Completed from <00000009>
41:05.332 X <00000008>start_vector - Destroyed
41:05.333 < <00000007>lock_and_hold - Received Stop from <00000001>
41:05.336 X <00000007>lock_and_hold - Destroyed

Logs produced by foreground processes (i.e. using run) are presented on stderr and then - without deliberate action - lost. Logs produced by background processes are directed into persistent storage and subsequently recovered with the ansar log command.

The first log command above fails as the default behaviour is to query for logs generated within the last 5 minutes; snooze-0 has been idle since it terminated at 00:41:05.332. The second command uses one of the log parameters to extend the query to the start of the current week - 12:00am on Monday. This matches everything from that moment onward and the first entry happens to be at 00:41:03.329. Note that the time value on logs is the full ISO 8601 format but they appear truncated here for brevity.

As well as WEEK there are also MONTH, DAY, HOUR, MINUTE, HALF, QUARTER, TEN and FIVE enumerations where HALF and QUARTER refer to portions of an hour and TEN and FIVE refer to numbers of minutes. In all cases, the enumeration describes the start of a fixed time period rather than a time span, e.g. --last=HOUR will list logs starting at the most recent hourly mark. To look back 60 minutes use --back=1h.

The ansar log command also accepts the following parameters;

`clock`	use local time for both input and output
`from`	start in ISO time format
`start`	start as index into start-stop history
`back`	start as negative offset from current time
`to`	end in ISO time format
`span`	positive offset from evaluated start
`count`	end as a number of logs

$ ansar log snooze-0 --last=MONTH --count=20

This command will list the first 20 logs generated by snooze-0, since the first of the current month. Log storage is self-maintaining. A FIFO approach ensures that when the total storage of logs reaches a configured maximum, the arrival of further logs causes the deletion of the oldest.

$ ansar status -l
zombie-0                 <377524> 12.7s

For a detailed view of operational processes use the --long-listing parameter (or the -l shorthand flag). This view includes the process ID and the time since the process was created.

Behaviour Of Background Processes Over Time

$ ansar history zombie-0
[0] 7h30m ago ... 1m5.4s (Aborted)
[1] 4m44.6s ago ... 4m34.3s (Aborted)

A role is a name for an instance of an executable which may be started and stopped many times within the lifetime of a home. Logs are seamless with respect to these ups and downs though it is fairly easy to infer the boundaries from the contents of individual logs. Ansar also keeps an explicit record of when processes are started, when they are stopped and the value returned by the main object. Use the history command to print a table of start times and run durations. The printed indexes can be used in the start parameter in the log command to select logs from the start of a particular execution.

$ ansar returned zombie-0 --start=0
{
    "value": [
        "ansar.create.lifecycle.Aborted",
        {},
        []
    ]
}

Those same indexes can also be used in the returned command to select which return value to print. If the command is used to access the results of the latest execution (e.g. no --start is specified) and that execution has not yet completed, the command will wait until the information is available.

Development Automation

Combining the nature of create_object()-based applications with the ansar command line tool goes some way towards engineering of multi-processing solutions. This section considers the potential to streamline the standard edit-build-test-debug loop, to relieve developers of as many repetitive, error-prone responsibilities as possible, within that multi-processing context.

A huge array of tools are available in this space, especially if cloud deployment is the end goal. The arrangements of tools and procedures suggested here are deliberately simple in the hope that any potential to integrate with your own development toolsets is as clear as possible.

The Standard Loop With Multi-Processing

Multi-processing complicates the standard development loop. Source code changes may be occurring over one executable or many. Test runs require the presence of multiple distinct processes that properly represent the current codebase, i.e. which source files have changed, which executables need building, do executables need to be copied from the build areas, are there running processes that need to be replaced, and how to execute and collate unit tests happening across multiple processes? What about those processes that need supporting data files?

Defining The Set Of Processes

This part of the tour involves a new home;

$ ansar -f destroy
$ ansar create --redirect-bin=dist
$ ansar add server
$ ansar add client
$ ansar list
client-0
server-0

The destroy command is used to delete the default home. Passing the -f flag ensures that any process associated with the old home is properly terminated. The create command prepares the new home folder for the subsequent add commands. A redirect is again used to link the new home with a build folder. Instances of the server and client executables are added and assume the default names, server-0 and client-0.

In this composition of processes the server is the component under development. The client is a test client - it connects to the server, submits requests and expects responses. A standard ansar method - available to every async object - is used to verify the details of each request-response pair. This creates a sequence of pass/fail records that go on to form the basis of a test report.

The server implements a very basic word mapping service. Words are sent across a connection and mapped inside the server to a stored alternative. The mapped alternative is sent back across the connection as a response. If no entry is found for a submitted word, the same word is echoed back to the client. Such a service might form the basis of a “hint” facillity for a spelling checker.

Note

Implementation of networking within the server and client components is for demonstration purposes only. There are several reasons why the approach should not be used for production quality software including scalability, the use of blocking sockets and the lack of message encoding/decoding.

When Clients Are Started Before Servers

An initial run of the new set of processes hits a bump. All the processes are effectively started at the same time and starting a client before the server has had a chance to establish itself will inevitably lead to problems;

$ ansar run --debug-level=DEBUG
29:26.188 + <00000007>start_vector - Created by <00000001>
29:26.188 ~ <00000007>start_vector - Executable "/home/dennis/gh/multi-processing-is-more-than-forking/.dev/bin/ansar" as process (698548)
29:26.188 ~ <00000007>start_vector - Working folder "/home/dennis/gh/multi-processing-is-more-than-forking"
29:26.188 ~ <00000007>start_vector - Running object "ansar.command.ansar_command.ansar"
29:26.188 ~ <00000007>start_vector - Class threads (1) "retries" (1)
29:26.189 + <00000008>ansar - Created by <00000007>
29:26.189 ~ <00000008>ansar - Call the sub-command function
29:26.189 ^ <00000008>ansar - Detect status of associated roles (server-0, client-0, zombie-0)
29:26.189 + <00000009>lock_and_hold - Created by <00000008>
..
29:26.304 ~ <00000008>start_vector - Running object "__main__.zombie"
29:26.304 ~ <00000008>start_vector - Class threads (1) "retries" (1)
29:26.304 ? <00000009>client - Session error - "[Errno 111] Connection refused"
29:26.304 X <00000009>client - Destroyed
29:26.304 < <00000008>start_vector - Received Completed from <00000009>

A connection has been attempted by the client and “refused”.

Without some kind of orchestration of activity at the networking level there is no way to advise the client of the appropriate moment to initiate connection. Ansar provides an alternative solution. Any process that is part of a home can be configured with a retry strategy.

The effect is that client-0 is performed repeatedly until a goal is reached or the strategy is exhausted. In this case the goal is to establish a valid connection. All that the client needs to do is return certain values that either keep the retries active or cause termination.

Please Repeat That

A single ansar command sets up the handling of connection failures;

$ cat client-retry
{
        "value": {
        "first_steps": [1.0, 2.0, 4.0]
    }
}
$ ansar set retry client-0 --encoding-file=client-retry
$ ansar run --debug-level=DEBUG
32:15.384 + <00000007>lock_and_hold - Created by <00000001>
..
32:15.384 + <00000008>start_vector - Created by <00000001>
32:15.385 ~ <00000008>start_vector - Executable "/home/dennis/gh/multi-processing-is-more-than-forking/dist/server" as process (729010)
32:15.385 ~ <00000008>start_vector - Executable "/home/dennis/gh/multi-processing-is-more-than-forking/dist/client" as process (729009)
32:15.385 ~ <00000008>start_vector - Working folder "/home/dennis/gh/multi-processing-is-more-than-forking"
32:15.385 ~ <00000008>start_vector - Working folder "/home/dennis/gh/multi-processing-is-more-than-forking"
32:15.385 ~ <00000008>start_vector - Running object "__main__.client"
32:15.385 ~ <00000008>start_vector - Class threads (1) "retries" (1)
32:15.385 ~ <00000008>start_vector - Running object "__main__.server"
32:15.385 + <00000009>Retry[INITIAL] - Created by <00000008>
32:15.385 ~ <00000008>start_vector - Class threads (1) "retries" (1)
32:15.385 < <00000009>Retry[INITIAL] - Received Start from <00000008>
32:15.385 + <00000009>server - Created by <00000008>
32:15.385 + <0000000a>client - Created by <00000009>
32:15.385 + <0000000a>listen - Created by <00000009>
32:15.385 ? <0000000a>client - Session error - "[Errno 111] Connection refused"
32:15.385 X <0000000a>client - Destroyed
32:15.385 < <00000009>Retry[ATTEMPTING] - Received Completed from <0000000a>
32:15.385 ^ <00000009>Retry[ATTEMPTING] - Pausing for 1.000000 seconds
32:15.385 > <00000009>Retry[ATTEMPTING] - Sent StartTimer to <00000003>
32:16.385 < <00000009>Retry[PAUSING] - Received T1 from <00000003>
32:16.386 + <0000000b>client - Created by <00000009>
32:16.387 ^ <0000000b>client - Connected to ('127.0.0.1', 65432)
32:16.387 + <0000000b>accepted - Created by <0000000a>
32:16.387 ^ <0000000b>accepted - Accepted on ('127.0.0.1', 52278)
32:16.394 = <0000000b>client - Expected b'fervent' for b'eager', got b'eager' (client.py:51)
32:16.394 = <0000000b>client - Expected b'define' for b'explain', got b'explain' (client.py:56)
32:16.395 = <0000000b>client - Expected b'droll' for b'fly', got b'fly' (client.py:61)
32:16.395 X <0000000b>accepted - Destroyed
32:16.395 X <0000000b>client - Destroyed
32:16.395 < <00000009>Retry[ATTEMPTING] - Received Completed from <0000000b>
32:16.395 X <00000009>Retry[ATTEMPTING] - Destroyed
32:16.395 < <00000008>start_vector - Received Completed from <00000009>
32:16.395 X <00000008>start_vector - Destroyed
..

The set command is used to update a small set of properties associated with each home entry and one such property is retry. Setting this value activates retries inside the create_object() function. Running the application object (e.g. client()) is subsequently considered to be an attempt and the value returned by each attempt influences what happens next;

`Maybe`	not successful, try again later
`Cannot`	abandon, bad request or environmental problem
*	any other message indicates success

Where a repeat attempt is indicated (Maybe) the retry machinery consults the retry property for a time delay and uses the T1 timer to impose the “down time”. In the given example the delay is 1.0s - the first value from the first_steps list. The retry property provides the following values;

`first_steps[]`	list of float, the initial time delays
`regular_steps`	float, repeating delay
`step_limit`	int, maximum number of delays
`randomized`	float, time slices for backoff
`truncated`	float, reduce the scale of backoff

The first 3 values can be used to describe a series of float values while the latter 2 enable a secondary adjustment of those values with the goal of avoiding the “everyone retrying at the same moment” phenomenon.

`first_steps`	`regular_steps`	`step_limit`	`randomized`	`truncated`	sequence
[1.0, 2.0, 4.0]	None	None	None	None	[1.0, 2.0, 4.0]
[]	1.0	None	None	None	[1.0, 1.0, 1.0, 1.0 …]
[]	1.0	4	None	None	[1.0, 1.0, 1.0, 1.0]
[1.0, 2.0, 4.0]	8.0	6	None	None	[1.0, 2.0, 4.0, 8.0, 8.0, 8.0]
[1.0, 2.0, 4.0]	8.0	6	0.25	0.5	[1.25, 2.5, 6.0, 11.0, 9.5, 10.5]

Any set of values involving a non-None regular_steps and a None step_limit describes an endless sequence. Combining first_steps with a value for randomized can produce a form of exponential backoff. The latter value is used to slice up the latest time delay into available slots and one of those slots is selected randomly. The truncation value reduces the portion of the time delay that is available for slicing, e.g. a value of 0.25 limits the potential adjustment to a quarter. The adjustment is additive.

It is the combination of the retry property and the conditions met by each attempt, that determines the final behaviour of the process.

Unit Tests In Multi-Process Solutions

Tests are implemented using the test() method (see client.py);

word, expect = b'eager', b'fervent'
s.sendall(word)
reply = s.recv(1024)
self.test(reply == expect, f"Expected {expect} for {word}, got {reply}")

A word is sent over a socket and the response is compared against an expected value. This fragment of code meets all the requirements of an ansar test.

Output occurs in two ways. Firstly, all failed tests (i.e. where the conditional evaluates to false) generate a log at the WARNING level. Secondly, all test results are collected in a background async object. Test applications such as the client request that information at the end of an execution and return the results in the form of a TestReport.

ar.test_enquiry(self)
report = self.select(ar.TestReport)
return report

A standalone execution of the client demonstrates this activity;

$ ansar list
client-0
server-0
$ ansar status
$ ansar start server-0
$ dist/client --debug-level=OBJECT
19:35:35.299 + <00000008>start_vector - Created by <00000001>
19:35:35.300 + <00000009>client - Created by <00000008>
19:35:35.300 ^ <00000009>client - Connected to ('127.0.0.1', 65432)
19:35:35.303 = <00000009>client - Expected b'fervent' for b'eager', got b'eager' (client.py:51)
19:35:35.303 = <00000009>client - Expected b'define' for b'explain', got b'explain' (client.py:56)
19:35:35.303 = <00000009>client - Expected b'droll' for b'fly', got b'fly' (client.py:61)
19:35:35.303 > <00000009>client - Sent Enquiry to <00000004>
19:35:35.304 < <00000009>client - Received TestReport from <00000004>
19:35:35.304 X <00000009>client - Destroyed
19:35:35.304 < <00000008>start_vector - Received Completed from <00000009>
19:35:35.304 X <00000008>start_vector - Destroyed
{
    "value": [
        "ansar.create.test.TestReport",
        {
            "failed": 3,
            "passed": 0,
            "tested": [
                {
                    "condition": false,
                    "line": 51,
                    "name": "client",
                    "source": "client.py",
                    "stamp": "2023-05-30T19:35:35.302931",
                    "text": "Expected b'fervent' for b'eager', got b'eager'"
                },
                {
                    "condition": false,
                    "line": 56,
                    "name": "client",
                    "source": "client.py",
                    "stamp": "2023-05-30T19:35:35.303186",
                    "text": "Expected b'define' for b'explain', got b'explain'"
                },
                {
                    "condition": false,
                    "line": 61,
                    "name": "client",
                    "source": "client.py",
                    "stamp": "2023-05-30T19:35:35.3034",
                    "text": "Expected b'droll' for b'fly', got b'fly'"
                }
            ]
        },
        []
         ]
}

Failed tests appear in the logging stream and all tests appear in the final JSON output. The details retained for each test can be seen in the tested list. As well as the more obvious inclusion of the condition and text values, ansar augments the results with the name of the module that performed the test and also the line number within that module. These values are critical to a good edit-run-debug loop, as demonstrated in the following sections.

Note

It is worth mentioning that test() has value in an application, independent of whether that application decides to return a TestReport. It is effectively a shorthand for “if condition is false log this warning”. Background collection of test results will eventually reach a maximum and at that point, will begin discarding test results. The maximum number retained is kept fairly small (a few hundred) for practical reasons. The collection cannot have infinite size and in a high velocity development loop, larger and larger numbers of failed tests have decreasing value.

There can be any number of processes like client-0 within a home, performing tests and generating TestReports. The results from these individual test processes can be inspected with commands such as ansar log, ansar history and ansar returned.

Quick Navigation Of Failed Tests

The next step is to collect the information from the TestReports and present them in a manner that facillitates quick navigation of the relevant source. Happily, this is exactly what ansar run can do. By adding a few arguments to the command, a useful listing is produced;

$ ansar stop server-0
$ ansar run --code-path=. --test-run
^Crole "client-0" (pass/fail): 0/3
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:51 - Expected b'fervent' for b'eager', got b'eager'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:56 - Expected b'define' for b'explain', got b'explain'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'
$

The ansar run command looks through the list of results from all the finished processes, checking for instances of TestReport. It collects these into a single TestSuite object. If it detects certain command-line information (i.e. --code-path=<path>) it uses that information to resolve the final names of the source files mentioned in individual tests.

By default the improved report information appears in the normal JSON output of the ansar command. Specifying a --test-run causes ansar to set aside its normal behaviour and instead provide a summary of the gathered information. It prints a table of the processes (i.e. roles) that supplied the test information, and then a second table consisting of source file, line number and warning text - one line for each failed test.

This is the final piece of the development loop. Running this command inside a VS Code bash window gives the IDE enough information for quick navigation to the offending lines of source code. Hovering over each line results in the underlining of the source address and a control-click takes the cursor to the exact location. Most modern IDEs support similar behaviour.

Custom Handling Of Test Results

If the “test run” information is not sophisticated enough or there is potential for better local integration, the ansar run command also supports the passing of the TestSuite to a designated executable.

$ ansar run  --code-path=. --test-analyzer=analyzer --debug-level=OBJECT
08:40:16.100 + <00000008>lock_and_hold - Created by <00000001>
08:40:16.101 > <00000008>lock_and_hold - Sent Ready to <00000001>
..
08:40:16.102 ^ <0000000b>client - Connected to ('127.0.0.1', 65432)
08:40:16.102 + <0000000c>accepted - Created by <0000000b>
08:40:16.102 ^ <0000000c>accepted - Accepted on ('127.0.0.1', 37696)
08:40:16.105 = <0000000b>client - Expected b'fervent' for b'eager', got b'eager' (client.py:51)
08:40:16.106 = <0000000b>client - Expected b'define' for b'explain', got b'explain' (client.py:56)
08:40:16.106 = <0000000b>client - Expected b'droll' for b'fly', got b'fly' (client.py:61)
..
08:40:16.106 X <00000009>start_vector - Destroyed
08:40:16.203 < <00000008>lock_and_hold - Received Stop from <00000001>
08:40:16.212 X <00000008>lock_and_hold - Destroyed
^C08:40:18.506 < <00000009>start_vector - Received Stop from <00000001>
08:40:18.506 > <00000009>start_vector - Sent Stop to <0000000a>
08:40:18.506 < <0000000a>server - Received Stop from <00000009>
..
08:40:18.806 + <00000008>start_vector - Created by <00000001>
08:40:18.807 + <00000009>analyzer - Created by <00000008>
08:40:18.807 ^ <00000009>analyzer - Analyzed client-0
08:40:18.807 X <00000009>analyzer - Destroyed
08:40:18.807 < <00000008>start_vector - Received Completed from <00000009>
08:40:18.807 X <00000008>start_vector - Destroyed
{
    "value": [
        "ansar.create.lifecycle.Ack",
        {},
        []
    ]
}

Test results are now passed to the analyzer executable. The analyzer logs the names of roles that supplied the results and terminates with an Ack. The ansar command assumes that the return value from the analyzer should be returned as the result for the run itself.

Deployment Of Supporting File Materials

Getting back to the development loop, there are still the 3 failed tests to consider;

40:16.105 = <0000000b>client - Expected b'fervent' for b'eager', got b'eager' (client.py:51)
40:16.106 = <0000000b>client - Expected b'define' for b'explain', got b'explain' (client.py:56)
40:16.106 = <0000000b>client - Expected b'droll' for b'fly', got b'fly' (client.py:61)

There is now a means to quickly navigate through the source code relating to the failed tests. To actually fix those failures there are two possibilities - either the test needs to change or the server needs to change. Changing the former is a simple matter of editing the offending line of test code and running the loop again;

$ vi +51 client.py
$ cat client.py
    ..
    word, expect = b'eager', b'eager'
    s.sendall(word)
    reply = s.recv(1024)
    self.test(reply == expect, f"Expected {expect} for {word}, got {reply}")
        ..
$ pyinstaller --onefile --log-level ERROR -p . client.py
$ ansar run --code-path=. --test-run
role "client-0" (pass/fail): 1/2
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:56 - Expected b'define' for b'explain', got b'explain'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'

Note

Use of the vi command is a means to demonstrate the workflow. As mentioned previously, the editing process would normally occur within the local IDE.

A test run now reports that there is no longer an issue with the test of eager. In those cases where the problem lies in the server it is similarly easy, though a small phase of setup is required;

$ ansar -f snapshot testing
$ find testing
testing
testing/settings-by-role
testing/settings-by-role/server-0.json
testing/settings-by-role/client-0.json
testing/resource-by-executable
testing/resource-by-executable/server
testing/resource-by-executable/client
testing/model-by-role
testing/model-by-role/server-0
testing/model-by-role/client-0

The snapshot command takes a snapshot of the current disk storage areas of the default home and arranges them under the folder name provided. Adding the -f flag ensures that any active roles are shutdown for the duration of the snapshot. With a safe and up-to-date image of all the operational files required by the home, it is now possible to modify that image and then deploy it to the home.

The first modification is to provide a database of word mappings. The server is nice enough to operate in the absence of that information, but full operation requires the data to be in place. Loading the server-0 role with an initial set of mappings looks like this;

$ ansar start server-0
$ cat word-map.json
{
    "value": [
        ["explain", "define"]
    ]
}
$ cp word-map.json testing/model-by-role/server-0
$ ansar --debug-level=OBJECT -f deploy --storage-path=testing
23:29:19.507 + <00000008>start_vector - Created by <00000001>
23:29:19.507 + <00000009>ansar - Created by <00000008>
23:29:19.508 ^ <00000009>ansar - Detected 1 model changes for "server-0"
23:29:19.509 ^ <00000009>ansar - Detect status of associated roles (server-0)
23:29:19.509 + <0000000a>lock_and_hold - Created by <00000009>
23:29:19.509 X <0000000a>lock_and_hold - Destroyed
23:29:19.509 < <00000009>ansar - Received Completed from <0000000a>
23:29:19.509 ^ <00000009>ansar - Stop roles (server-0)
..
23:29:20.768 < <00000009>ansar - Received Completed from <0000000b>
23:29:20.768 ^ <00000009>ansar - Starting transfer of materials
23:29:20.768 + <0000000c>FolderTransfer[INITIAL] - Created by <00000009>
23:29:20.768 < <0000000c>FolderTransfer[INITIAL] - Received Start from <00000009>
23:29:20.768 + <0000000d>folder_transfer - Created by <0000000c>
23:29:20.769 ^ <0000000d>folder_transfer - File transfer (1 deltas) to /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0
23:29:20.769 ^ <0000000d>folder_transfer - Move 1 aliases to targets
23:29:20.770 ^ <0000000d>folder_transfer - Clear 0 aliases
23:29:20.770 X <0000000d>folder_transfer - Destroyed
23:29:20.770 < <0000000c>FolderTransfer[RUNNING] - Received Completed from <0000000d>
23:29:20.770 X <0000000c>FolderTransfer[RUNNING] - Destroyed
23:29:20.770 < <00000009>ansar - Received Completed from <0000000c>
23:29:20.770 ^ <00000009>ansar - Completed transfer to "/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0"
23:29:20.770 ^ <00000009>ansar - Restoring 1 stopped roles
23:29:20.773 + <0000000e>Process[INITIAL] - Created by <00000009>
23:29:20.773 < <0000000e>Process[INITIAL] - Received Start from <00000009>
..
23:29:20.884 < <00000008>start_vector - Received Completed from <00000009>
23:29:20.884 X <00000008>start_vector - Destroyed
$ ansar status
server-0
$ ansar run client-0 --code-path=. --test-run
role "client-0" (pass/fail): 2/1
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'

There is now a single failed test - “explain” is successfully being mapped to “define”. Copying a JSON encoding to the testing/model-by-role/server-0 folder and running the ansar deploy command installs the word-map.json database within the operational server-0.

For this demonstration the server-0 role was started before the deploy and test run commands. This was to highlight the awareness of the current operational state and the degree of automation. Internal phases include;

the detection of changes within the snapshot,
evaluation of affected roles,
detection of those affected roles that are also operational,
terminations as necessary,
copying of changes,
and restoring any terminated roles.

Deploying of materials from a snapshot such as testing into an active home is optimized to the least I/O possible. Source and destination areas are compared, resulting in a sequence of delta operations. Once any active roles are terminated the deltas are executed, bringing the home in-sync with the external snapshot.

With all the pieces in their correct places, fixing the last remaining failed test is simple;

$ vi +61 testing/model-by-role/server-0/word-map.json
$ cat testing/model-by-role/server-0/word-map.json
{
    "value": [
        ["explain", "define"],
        ["fly", "droll"]
    ]
}
$ ansar -f deploy --storage-path=testing
$ ansar run client-0 --code-path=. --test-run
role "client-0" (pass/fail): 3/0
$

A Streamlined, Multi-Process Development Loop

The previous section was a tour through the commands used for the development of a multi-process solution. It also introduced a variation in that basic workflow, by starting server-0 and then running the test-client separately. This section elaborates on that variation to maximize the benefits of ansar deploy and packages the related commands into a standard makefile.

Start With Nothing

$ make clean
rm -f dist/analyzer dist/busy dist/client dist/factorial dist/noop dist/server dist/snooze dist/zombie
rm -rf testing
ansar -f destroy

The clean target deletes build artefacts, the extracted snapshot and lastly, it deletes the composition of processes from the filesystem. Included in that last operation is a termination of any lingering processes, i.e. the non-testing, operational roles.

Create The Multi-Process Configuration

$ make home
pyinstaller --onefile --log-level ERROR -p . analyzer.py
pyinstaller --onefile --log-level ERROR -p . busy.py
..
ansar create
ansar deploy dist
ansar add server
ansar add client test-client
ansar add zombie
ansar set retry test-client --encoding-file=client-retry
ansar snapshot testing

The home target arranges everything such that roles are ready to be executed. This time around creation does not involve the redirect of the bin folder. Instead, executables are deployed to the home with the ansar deploy dist command. Copying executables from one folder to another might seem like a burden. In practise the runtime overhead does not intrude heavily on the workflow. Copying can be significantly minimized. Given two folders of executables (a source and a destination) it is possible to calculate a delta and perform optimal updates. The arrangement also separates the build chain from the operational home.

Full discussion of build chains, software pipelines and repo management (mono-repo vs poly-repo) is beyond the scope of this document. Having the 2 approaches available (i.e. --redirect-bin and ansar deploy) improves the possibility of integration.

Establish An Operational State

$ make start
ansar -f start "test-.*" --invert-search

The start target initiates those roles that are being tested, rather than those roles that perform the testing. The home is now ready for test runs.

Begin Development - What Needs Doing

$ make
ansar --force --debug-level=CONSOLE deploy dist testing
00:05:01.532 ^ <00000009>ansar - Nothing to deploy
ansar run "test-.*" --code-path=. --test-run
role "test-client" (pass/fail): 0/3
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:51 - Expected b'eager' for b'fervent', got b'fervent'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:56 - Expected b'define' for b'explain', got b'explain'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'
make: *** [Makefile:64: test] Error 1

Omitting the target is a synonym for make test. An ansar deploy command checks build artefacts and the file materials under testing, for any necessary updates - the answer in this case is no. It then performs a test run including only those roles that generate test reports. There are 3 familiar failed tests and the make itself terminates with an error, i.e. --test-run affects the exit code of the ansar command.

Deploying An Operational File

$ cp word-map.json testing/model-by-role/server-0
$ make
ansar --force --debug-level=CONSOLE deploy dist testing
00:05:23.002 ^ <00000009>ansar - Detected 1 model changes for "server-0"
00:05:23.003 ^ <00000009>ansar - Detect status of associated roles (server-0)
00:05:23.003 ^ <00000009>ansar - Stop roles (server-0)
00:05:23.003 ^ <00000009>ansar - Poll for termination
00:05:24.253 ^ <00000009>ansar - Detect status of associated roles (server-0)
00:05:24.261 ^ <00000009>ansar - Starting transfer of materials
00:05:24.262 ^ <0000000d>folder_transfer - File transfer (1 deltas) to /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0
00:05:24.263 ^ <0000000d>folder_transfer - Move 1 aliases to targets
00:05:24.264 ^ <0000000d>folder_transfer - Clear 0 aliases
00:05:24.264 ^ <00000009>ansar - Completed transfer to "/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0"
00:05:24.264 ^ <00000009>ansar - Restoring 1 stopped roles
ansar run "test-.*" --code-path=. --test-run
role "test-client" (pass/fail): 1/2
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:51 - Expected b'eager' for b'fervent', got b'fervent'
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'
make: *** [Makefile:64: test] Error 1

Installing the word-map.json clears one of the failing tests.

Editing Source Code

$ vi +51 client.py
$ make
pyinstaller --onefile --log-level ERROR -p . client.py
ansar --force --debug-level=CONSOLE deploy dist testing
00:07:21.597 ^ <00000009>ansar - Detected 1 changing executables (added 1 associated roles)
00:07:21.597 ^ <00000009>ansar - Detect status of associated roles (test-client)
00:07:21.606 ^ <00000009>ansar - Starting transfer of materials
00:07:21.606 ^ <0000000c>folder_transfer - File transfer (1 deltas) to /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/bin
00:07:21.613 ^ <0000000c>folder_transfer - Move 1 aliases to targets
00:07:21.614 ^ <0000000c>folder_transfer - Clear 0 aliases
00:07:21.615 ^ <00000009>ansar - Completed transfer to "/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/bin"
ansar run "test-.*" --code-path=. --test-run
role "test-client" (pass/fail): 2/1
/home/dennis/gh/multi-processing-is-more-than-forking/client.py:61 - Expected b'droll' for b'fly', got b'fly'
make: *** [Makefile:64: test] Error 1

Editing the test client clears the second failing test.

Modifying An Operational File

$ vi testing/model-by-role/server-0/word-map.json
$ make
ansar --force --debug-level=CONSOLE deploy dist testing
00:08:04.145 ^ <00000009>ansar - Detected 1 model changes for "server-0"
00:08:04.146 ^ <00000009>ansar - Detect status of associated roles (server-0)
00:08:04.146 ^ <00000009>ansar - Stop roles (server-0)
00:08:04.146 ^ <00000009>ansar - Poll for termination
00:08:05.396 ^ <00000009>ansar - Detect status of associated roles (server-0)
00:08:05.427 ^ <00000009>ansar - Starting transfer of materials
00:08:05.428 ^ <0000000d>folder_transfer - File transfer (1 deltas) to /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0
00:08:05.429 ^ <0000000d>folder_transfer - Move 1 aliases to targets
00:08:05.430 ^ <0000000d>folder_transfer - Clear 0 aliases
00:08:05.430 ^ <00000009>ansar - Completed transfer to "/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/model/server-0"
00:08:05.430 ^ <00000009>ansar - Restoring 1 stopped roles
ansar run "test-.*" --code-path=. --test-run
$ make
ansar --force --debug-level=CONSOLE deploy dist testing
00:08:26.570 ^ <00000009>ansar - Nothing to deploy
ansar run "test-.*" --code-path=. --test-run
$ ansar status -l
server-0                 <960962> 17.2s
zombie-0                 <960776> 1m46.3s

Adding another word mapping to word-map.json clears the last failing test. The make command is completing without error, i.e. the test-run is now producing a zero exit code. Repeating the make command shows the ansar deploy command detecting that nothing significant has changed. Lastly, the ansar status is used to show that zombie-0 has been left undisturbed - its runtime is significantly longer than the server-0 runtime - through all the changes to code and application files.

Summary Of The Multi-Process Development Loop

Development activity is reduced to making the necessary changes and running the make command. This inherently allows the developer to focus on the problems and potential fixes, without the distraction of correctly propagating changes on every iteration of the loop. Propagation that can involve multiple source files, building of executables, deployment of application files and management of operational processes.

Adding Some Workload

Simulation of operational workloads can be difficult. In the case of network servers there is the immediate difficulty that multiple client processes will be needed.

Ansar supports the creation of multiple processes as a single command. More accurately, this is the creation of multiple roles for the one common executable.

$ ansar add client 'test-client-{number}' --start=10 --count=40
$ ansar set retry 'test-client-\d\d' --encoding-file=client-retry
$ ansar list
server-0
test-client
test-client-10
test-client-11
test-client-12
test-client-13
test-client-14
test-client-15
..
test-client-49
zombie-0
$ ansar get retry test-client-49
{
    "value": {
         "first_steps": [
            1.0,
            2.0,
            4.0
        ]
    }
}

When adding a role, the role name is optional and often omitted, i.e. adding a role for the client executable results in the default name client-0. This is because the role name is assumed to be a template to be expanded to the final name. Templates are permitted to reference the variables executable and number, where the former expands to the executable being added and the latter expands to a runtime integer. Internally, the add command performs a loop based on start and count variables, which default to 0 and 1 respectively. For each iteration it expands the template passing the current loop index as number before populating the home with the new definition. By setting the start and count values on the command line, a single add command can create multiple roles. The default role name is {executable}-{number}.

The ansar add command above adds 40 new roles with names like test-client-49. The ansar set command caters to these situations by accepting a search pattern and applying the change to each match. Setting the retry property is probably more important in this scenario - multiple clients will be clammering for the attention of the lone server.

$ make
ansar --force --debug-level=CONSOLE deploy dist testing
02:28:17.017 ^ <00000009>ansar - Nothing to deploy
ansar --debug-level=DEBUG run "test-.*" --code-path=. --test-run
02:28:17.317 + <00000008>start_vector - Created by <00000001>
02:28:17.317 ~ <00000008>start_vector - Executable "/home/dennis/gh/multi-processing-is-more-than-forking/.dev/bin/ansar" as process (964752)
02:28:17.317 ~ <00000008>start_vector - Working folder "/home/dennis/gh/multi-processing-is-more-than-forking"
02:28:17.317 ~ <00000008>start_vector - Running object "ansar.command.ansar_command.ansar"
02:28:17.317 ~ <00000008>start_vector - Class threads (1) "retries" (1)
02:28:17.317 + <00000009>ansar - Created by <00000008>
02:28:17.317 ~ <00000009>ansar - Call the sub-command function
02:28:17.318 ^ <00000009>ansar - Detect status of associated roles (test-client-27, test-client-15, test-client-38, test-client-24, test-client-23, test-client-36, test-client-18, test-client-31, test-client-32, test-client-43, test-client-35, test-client-19, test-client-41, test-client-49, test-client-10, test-client-28, test-client-11, test-client-48, test-client-39, test-client-44, test-client-40, test-client-22, test-client-45, test-client-21, test-client-30, test-client-46, test-client-26, test-client-20, test-client-37, test-client-47, test-client-25, test-client-17, test-client-33, test-client-13, test-client-34, test-client, test-client-12, test-client-16, test-client-42, test-client-14, test-client-29)
02:28:17.318 + <0000000a>lock_and_hold - Created by <00000009>
02:28:17.318 > <0000000a>lock_and_hold - Sent Ready to <00000009>
02:28:17.318 + <0000000b>lock_and_hold - Created by <00000009>
02:28:17.319 + <0000000c>lock_and_hold - Created by <00000009>
..
02:28:17.372 X <00000014>lock_and_hold - Destroyed
02:28:17.373 < <00000009>ansar - Received Completed from <00000014>
02:28:17.373 X <0000000f>lock_and_hold - Destroyed
02:28:17.373 < <00000009>ansar - Received Completed from <0000000f>
02:28:17.373 ^ <00000009>ansar - Running "test-.*" (.ansar-home)
02:28:17.373 + <00000033>Process[INITIAL] - Created by <00000009>
02:28:17.373 < <00000033>Process[INITIAL] - Received Start from <00000009>
02:28:17.373 ~ <00000033>Process[INITIAL] - Execute /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/bin/client --call-signature=o --point-of-origin=1 --home-path=/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home --role-name=test-client-27
02:28:17.373 ( <00000033>Process[INITIAL] - Started process (964802)
02:28:17.373 + <00000034>wait - Created by <00000033>
02:28:17.374 + <00000035>Process[INITIAL] - Created by <00000009>
02:28:17.374 < <00000035>Process[INITIAL] - Received Start from <00000009>
02:28:17.374 ~ <00000035>Process[INITIAL] - Execute /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/bin/client --call-signature=o --point-of-origin=1 --home-path=/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home --role-name=test-client-15
02:28:17.374 ( <00000035>Process[INITIAL] - Started process (964804)
02:28:17.374 + <00000036>wait - Created by <00000035>
02:28:17.374 + <00000037>Process[INITIAL] - Created by <00000009>
02:28:17.374 < <00000037>Process[INITIAL] - Received Start from <00000009>
02:28:17.374 ~ <00000037>Process[INITIAL] - Execute /home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home/bin/client --call-signature=o --point-of-origin=1 --home-path=/home/dennis/gh/multi-processing-is-more-than-forking/.ansar-home --role-name=test-client-38
02:28:17.374 ( <00000037>Process[INITIAL] - Started process (964806)
..
02:28:18.208 < <00000069>Process[EXECUTING] - Received Completed from <00000083>
02:28:18.208 ) <00000069>Process[EXECUTING] - Process (965116) ended with 0
02:28:18.208 X <00000069>Process[EXECUTING] - Destroyed
02:28:18.208 < <00000009>ansar - Received Completed from <00000069>
02:28:18.208 ^ <00000009>ansar - Completion for "test-client-14" (<ansar.create.test.TestReport object at 0x7fce7c430d00>)
02:28:18.210 X <00000084>wait - Destroyed
02:28:18.211 < <0000006a>Process[EXECUTING] - Received Completed from <00000084>
02:28:18.211 ) <0000006a>Process[EXECUTING] - Process (965123) ended with 0
02:28:18.211 X <0000006a>Process[EXECUTING] - Destroyed
02:28:18.211 < <00000009>ansar - Received Completed from <0000006a>
02:28:18.211 ^ <00000009>ansar - Completion for "test-client-29" (<ansar.create.test.TestReport object at 0x7fce7c433580>)
role "test-client-15" (pass/fail): 3/0
role "test-client-35" (pass/fail): 3/0
role "test-client-19" (pass/fail): 3/0
role "test-client-24" (pass/fail): 3/0
role "test-client-43" (pass/fail): 3/0
role "test-client-31" (pass/fail): 3/0
role "test-client-36" (pass/fail): 3/0
role "test-client-27" (pass/fail): 3/0
role "test-client-23" (pass/fail): 3/0
role "test-client-41" (pass/fail): 3/0
role "test-client-10" (pass/fail): 3/0
role "test-client-38" (pass/fail): 3/0
..
role "test-client" (pass/fail): 3/0
role "test-client-13" (pass/fail): 3/0
role "test-client-12" (pass/fail): 3/0
role "test-client-16" (pass/fail): 3/0
role "test-client-42" (pass/fail): 3/0
role "test-client-14" (pass/fail): 3/0
role "test-client-29" (pass/fail): 3/0
02:28:18.272 X <00000009>ansar - Destroyed
02:28:18.272 < <00000008>start_vector - Received Completed from <00000009>
02:28:18.272 X <00000008>start_vector - Destroyed

The same make command now runs a much more ambitious test workload. All the test clients completed successfully, passing their respective TestReports back to the run command. Those reports indicated that the new test encountered zero problems. This is the correct expectation given that all test failures had been cleared and every instance of test-client-xx is a replication of a test process that was already passing.