[MM3-users] toward improving correctness when running on single core system ...

June 19, 2023


      Nelson Strother writes:
...
From my observations, mailman 3.3.8 is unreliable when running on
a single core system.
"A single core system" == "your single core system", right?  Have you
confirmed this on other single core systems?  Can't confirm at the
moment, I don't think I have access to Mailman running on a single
core system.  (I guess Linux provides ways to lock a process group to
a single core, but the RDBMS is probably an independent process group,
and that probably matters.  It will be a while before I can confirm.)
How about for other versions of Mailman 3?
...
In either case, mailman status provides the same information,
with no hint of missing or damaged capabilities when one or more of
the runner processes are stillborn.
If so, we should fix mailman status.  Presumably there's also a
chance for the runner to detect and log the expired lock before it
dies.
...
One potential improvement would be to increase the lifetime of the
locks used by each of the runner processes during mailman start.
The resource being locked is the core database.  One guess about the
difference with multicore systems is that on them the RDBMS runs on a
different core from Mailman, but that doesn't explain why runner
initialization takes so long that locks expire.  If a #cores > 1
system initializes in second(s), a 1 core system "should" initialize
in #cores * seconds at most.
So I guess something else is accessing the database and interfering
with the runners (or vice versa).  It's possible that changing to a
different RDBMS (or a different transaction granularity) will
alleviate the problem.  (This is not a fix because somebody isn't
locking when they should, or perhaps using the wrong lock.  But it
might help.)
Another workaround would be to have the runners start synchronously
(ie, have the master process wait on each one to signal initialization
complete before starting the next one), assuming the current approach
is "fork and forget".
...
But it still seems unsatisfactory for availability /
performance, as the elapsed wall clock time for the system to
settle down and be functional and responsive is approximately 13
minutes.
[...]
The best "performance" I have yet obtained on this single core
system is [by] [3] [which waits 20s before trying to obtain the
lock].
Is 20s necessary?  There are a lot of runners, you may be able to cut
the startup time by 2-3 minutes if you can cut that to 5s or less.
That would still be 100x longer than the naive #cores * seconds
estimate, but perhaps significant.
Steve

[MM3-users] toward improving correctness when running on single core system ...

Stephen J. Turnbull