Neither .../logs/mailman.log nor /var/log/syslog contain any recording
of "Master detected ..." nor process exits other than at the time of
mailman stop
commands. Watching during these transitions with vmstat
shows a decrease in free pages and an increase in swapping, but no
definitive hint of the OOM reaper running.
I did not intend to imply that mailman start
(predictably) consumes
more time than mailman restart
. Both are lengthy and variable; I think
one cause for mailman stop
consuming more time than usual is when
mailman receives a message as it is trying to stop, which is understandable.
Interestingly if I use systemctl start mailman
thus far the results are:
- the wall clock duration is shorter than when
mailman start
is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner process starting from .../mailman/.local/bin/runner and ending with "flufl.lock._lockfile.NotLockedError: Already unlocked".
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
On 6/6/23 14:47, Mark Sapiro wrote:
On 6/6/23 07:58, Nelson Strother wrote:
No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via
pip install mailman
on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles thatmailman start
andmailman stop
each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?If a runner has died, its death and a reason should be logged in Mailman's var/logs/mailman.log with a message similar to
Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)
This may not be the case if the runner is killed by the OS for an out of memory or similar reason. For this, look in syslog.
mailman stop
can take a long time because it is waiting for a runner to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that issue was fixed long ago. I don't understand whymailman start
would take more time thanmailman restart
. In fact,mailman restart
effectively doesstop
andstart
, but only for those runners which are running.Since you seem to frequently have missing runners, I suspect something like an OOM condition is causing the OS to kill them. Although, I wonder if you are correctly interpreting the logs. While the absence of the
retry
,task
,nntp
andarchive
runners might not be noticed except for messages not being archived, if either thein
orpipeline
runner is not running, no list posts will be processed.Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use
mailman stop
andmailman start
instead?I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for this.