toward improving correctness when running on single core system ...

June 19, 2023 · *all*

      I recognize that most real computer systems now are equipped with
multiple CPUs, usually with each CPU having multiple cores or hardware
threads. But when one obtains a computer system from a service provider
with a small expected load, it is likely that a virtual computer system
will provide only a single core. Yes, one can obtain a multi-core
virtual system, maybe for only twice the cost; but it would seem
wasteful to make that choice solely to support starting and stopping the
Mailman3 program when the system load at all other times is satisfied by
a single core system.
From my observations, mailman 3.3.8 is unreliable when running on a
single core system. When mailman start is issued, there is a
moderately high probability that one or more of the thirteen runner
processes will not survive this startup phase. Depending upon the
current usage of the mailman installation, one may or may not notice
this fact. In either case, mailman status provides the same
information, with no hint of missing or damaged capabilities when one or
more of the runner processes are stillborn. [1]
One potential improvement would be to increase the lifetime of the locks
used by each of the runner processes during mailman start. [2] Based
on a small sample of runs of mailman start, this change may provide
correctness (i.e. I have yet to see a runner process go MIA). But it
still seems unsatisfactory for availability / performance, as the
elapsed wall clock time for the system to settle down and be functional
and responsive is approximately 13 minutes. While this is an improvement
from the 20 minute or greater delays observed before making this change,
it still differs greatly from the nearly instantaneous startup time
experienced on multicore / multiprocessor systems.
The best "performance" I have yet obtained on this single core system is
WITHOUT the change in [2] below, but by crossing abstraction boundaries
and reducing the amount of time this single CPU spends futilely busy
waiting for itself when it has not provided an opportunity for another
process (maybe even the process currently holding the desired lock) to
make any progress. [3] With this change, "only" 9 minutes elapsed wall
clock time is wasted (with maybe only the middle 4 minutes at 100% CPU
utilization) before the system settles down and is functional and
responsive. However, ample room for improvement remains. [4]
Please suggest better ideas or corrections.
Nelson
[1] For a few examples of these failures, see:
https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/F...
[2] Here is a diff from the original mailman/database/factory.py (or
.../mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py
if you wish to apply this to an installed system):
...
25a26
...
from datetime import timedelta
53c54,55
<         with Lock(os.path.join(config.LOCK_DIR, 'dbcreate.lck')):

...
         with Lock(os.path.join(config.LOCK_DIR, 'dbcreate.lck'),
timedelta(seconds=30)):
which increases the delay from the default of 15 seconds to 30 seconds
before another process will be empowered to break the lock being held by
one of mailman's runner processes, which leads to the death of a process.
...
7a8
...
import subprocess
338a340,344
if sys.platform.startswith('linux') and 

(len(subprocess.check_output(["lscpu",
"-p"]).splitlines()) == 5):
# there is only one processor [four lines of comments,
and only one line for one CPU]
# let another process run before joining the traffic
jam for this lock
os.system('sleep 20s')
For possible similar benefits on other platforms, one should also
include variations on this theme using something like e.g.:
if sys.platform.startswith('win32') and 

subprocess.check_output('wmic path win32_Processor get
NumberOfLogicalProcessors').strip().endswith(b'1'):
I seriously doubt all (or even any?) other users of the flufl.lock
[3] Here is a diff from the original
.../mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py :
package would benefit from this exact change ... but many may benefit
from sleeping for e.g. a half-second when on a single core? In the case
of Mailman3, sleeping for a fixed duration shorter than 20 seconds
retains the original problem of broken locks and runner process death,
while sleeping for a fixed duration longer than 20 seconds leads to an
increase in both the wall clock delay time as well as the CPU time
consumed during mailman start. I have not experimented with random or
staggered delays, which may further reduce the number of simultaneously
blocked waiters for this lock.
[4] I have seen mailman startcomplete in less than 1 second wall clock
time on a multicore / multiprocessor system. All of these observations
made with mailman 3.3.8 as installed via pip.

Nelson Strother

Stephen J. Turnbull

Nelson Strother

Stephen Daniel

Nelson Strother

Ruth Ivimey-Cook

Odhiambo Washington

Nelson Strother

Stephen J. Turnbull

tags

participants (5)