Nelson Strother writes:
From my observations, mailman 3.3.8 is unreliable when running on a single core system.
"A single core system" == "your single core system", right? Have you confirmed this on other single core systems? Can't confirm at the moment, I don't think I have access to Mailman running on a single core system. (I guess Linux provides ways to lock a process group to a single core, but the RDBMS is probably an independent process group, and that probably matters. It will be a while before I can confirm.)
How about for other versions of Mailman 3?
In either case,
mailman status
provides the same information, with no hint of missing or damaged capabilities when one or more of the runner processes are stillborn.
If so, we should fix mailman status
. Presumably there's also a
chance for the runner to detect and log the expired lock before it
dies.
One potential improvement would be to increase the lifetime of the locks used by each of the runner processes during
mailman start
.
The resource being locked is the core database. One guess about the difference with multicore systems is that on them the RDBMS runs on a different core from Mailman, but that doesn't explain why runner initialization takes so long that locks expire. If a #cores > 1 system initializes in second(s), a 1 core system "should" initialize in #cores * seconds at most.
So I guess something else is accessing the database and interfering with the runners (or vice versa). It's possible that changing to a different RDBMS (or a different transaction granularity) will alleviate the problem. (This is not a fix because somebody isn't locking when they should, or perhaps using the wrong lock. But it might help.)
Another workaround would be to have the runners start synchronously (ie, have the master process wait on each one to signal initialization complete before starting the next one), assuming the current approach is "fork and forget".
But it still seems unsatisfactory for availability / performance, as the elapsed wall clock time for the system to settle down and be functional and responsive is approximately 13 minutes. [...] The best "performance" I have yet obtained on this single core system is [by] [3] [which waits 20s before trying to obtain the lock].
Is 20s necessary? There are a lot of runners, you may be able to cut the startup time by 2-3 minutes if you can cut that to 5s or less. That would still be 100x longer than the naive #cores * seconds estimate, but perhaps significant.
Steve