I recognize that most real computer systems now are equipped with multiple CPUs, usually with each CPU having multiple cores or hardware threads. But when one obtains a computer system from a service provider with a small expected load, it is likely that a virtual computer system will provide only a single core. Yes, one can obtain a multi-core virtual system, maybe for only twice the cost; but it would seem wasteful to make that choice solely to support starting and stopping the Mailman3 program when the system load at all other times is satisfied by a single core system.
From my observations, mailman 3.3.8 is unreliable when running on a
single core system. When mailman start
is issued, there is a
moderately high probability that one or more of the thirteen runner
processes will not survive this startup phase. Depending upon the
current usage of the mailman installation, one may or may not notice
this fact. In either case, mailman status
provides the same
information, with no hint of missing or damaged capabilities when one or
more of the runner processes are stillborn. [1]
One potential improvement would be to increase the lifetime of the locks
used by each of the runner processes during mailman start
. [2] Based
on a small sample of runs of mailman start
, this change may provide
correctness (i.e. I have yet to see a runner process go MIA). But it
still seems unsatisfactory for availability / performance, as the
elapsed wall clock time for the system to settle down and be functional
and responsive is approximately 13 minutes. While this is an improvement
from the 20 minute or greater delays observed before making this change,
it still differs greatly from the nearly instantaneous startup time
experienced on multicore / multiprocessor systems.
The best "performance" I have yet obtained on this single core system is WITHOUT the change in [2] below, but by crossing abstraction boundaries and reducing the amount of time this single CPU spends futilely busy waiting for itself when it has not provided an opportunity for another process (maybe even the process currently holding the desired lock) to make any progress. [3] With this change, "only" 9 minutes elapsed wall clock time is wasted (with maybe only the middle 4 minutes at 100% CPU utilization) before the system settles down and is functional and responsive. However, ample room for improvement remains. [4]
Please suggest better ideas or corrections.
Nelson
[1] For a few examples of these failures, see: https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/F...
[2] Here is a diff from the original mailman/database/factory.py (or .../mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py if you wish to apply this to an installed system):
25a26
from datetime import timedelta 53c54,55 < with Lock(os.path.join(config.LOCK_DIR, 'dbcreate.lck')):
with Lock(os.path.join(config.LOCK_DIR, 'dbcreate.lck'), timedelta(seconds=30)): which increases the delay from the default of 15 seconds to 30 seconds before another process will be empowered to break the lock being held by one of mailman's runner processes, which leads to the death of a process.
7a8
import subprocess 338a340,344 if sys.platform.startswith('linux') and
(len(subprocess.check_output(["lscpu", "-p"]).splitlines()) == 5): # there is only one processor [four lines of comments, and only one line for one CPU] # let another process run before joining the traffic jam for this lock os.system('sleep 20s') For possible similar benefits on other platforms, one should also include variations on this theme using something like e.g.: if sys.platform.startswith('win32') and
subprocess.check_output('wmic path win32_Processor get NumberOfLogicalProcessors').strip().endswith(b'1'): I seriously doubt all (or even any?) other users of the flufl.lock
[3] Here is a diff from the original
.../mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py :
package would benefit from this exact change ... but many may benefit
from sleeping for e.g. a half-second when on a single core? In the case
of Mailman3, sleeping for a fixed duration shorter than 20 seconds
retains the original problem of broken locks and runner process death,
while sleeping for a fixed duration longer than 20 seconds leads to an
increase in both the wall clock delay time as well as the CPU time
consumed during mailman start
. I have not experimented with random or
staggered delays, which may further reduce the number of simultaneously
blocked waiters for this lock.
[4] I have seen mailman start
complete in less than 1 second wall clock
time on a multicore / multiprocessor system. All of these observations
made with mailman 3.3.8 as installed via pip.