[MM3-users] Re: Mailman sends Welcome emails, but does not send [List] postings emails

June 8, 2023

      Neither .../logs/mailman.log nor /var/log/syslog contain any recording
of "Master detected ..." nor process exits other than at the time of
mailman stop commands. Watching during these transitions with vmstat
shows a decrease in free pages and an increase in swapping, but no
definitive hint of the OOM reaper running.
I did not intend to imply that mailman start (predictably) consumes
more time than mailman restart. Both are lengthy and variable; I think
one cause for mailman stop consuming more time than usual is when
mailman receives a message as it is trying to stop, which is understandable.
Interestingly if I use systemctl start mailman thus far the results are:

the wall clock duration is shorter than when mailman start is issued
from the command line, and
-- either that all of the runners remain present once they are started,
-- or I can see in syslog a traceback from each missing runner process
starting from .../mailman/.local/bin/runner and ending with
"flufl.lock._lockfile.NotLockedError: Already unlocked".

I do not yet understand how to make use of these clues, but at least one
can see an epitaph from each deceased process.
On 6/6/23 14:47, Mark Sapiro wrote:
...
On 6/6/23 07:58, Nelson Strother wrote:
...
No errors are being recorded in the mailman log files. This is GNU
Mailman 3.3.8 via pip install mailman on Debian 5.10.179-1
(2023-05-12) running on a shared system where VMware gives this
server enough cycles that mailman start and mailman stop each
consume from 20 minutes to an hour of wall clock time, so I do not
issue those commands recreationally, attempting to keep the system
available for users. What should I do to help understand the cause
for these failures?
If a runner has died, its death and a reason should be logged in
Mailman's var/logs/mailman.log with a message similar to
Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in,
slice: 1/1)
This may not be the case if the runner is killed by the OS for an out
of memory or similar reason. For this, look in syslog.
mailman stop can take a long time because it is waiting for a runner
to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that
issue was fixed long ago. I don't understand why mailman start would
take more time than mailman restart. In fact, mailman restart
effectively does stop and start, but only for those runners which
are running.
Since you seem to frequently have missing runners, I suspect something
like an OOM condition is causing the OS to kill them. Although, I
wonder if you are correctly interpreting the logs. While the absence
of the retry, task, nntp and archive runners might not be
noticed except for messages not being archived, if either the in or
pipeline runner is not running, no list posts will be processed.
...
Would not it be helpful for this limitation of restart to be included
in:
mailman restart --help
with a suggestion to use mailman stop and mailman start instead?
I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for
this.