Outgoing runner lockup? + "slice inspector" tool

Feb. 1, 2025

      Hi,
I'm advising a user on a system which experiences occasional lockups
in an outgoing runner.  The runner contacts a remote smarthost, and
occasionally locks up with the smtp connection still open according to
"lsof -i :25".  This state seems to be permanent: the backlog for that
slice starts to grow without bound, for at least an hour.  When this
was happening once a month, they would just restart Mailman core, and
the backlog would clear in minutes.  But recently it's been happening
daily which is distracting their staff, and worries me.
Another thing that is strange about this site is that it should be
possible to hit that runner with a SIGUSR1 and restart it.  This works
for me, but on that system the stuck runner exits, but does not
restart.
Since normally it only happens to one runner of several, I wanted to
identify the runner and process.  I attach the tool I developed, for
anyone who might have configured multiple runners and is interested to
see the distribution across runners.
Has anybody seen an outgoing runner lock up with an open smtp session?
Any ideas on why?
My analysis so far:

Because a restart works every time, I'm pretty sure it doesn't have
anything to do with message content.
I believe both the Mailman host and the outgoing smarthost are
Linodes, in the same datacenter.  The problematic Mailman system is
Mailman 3.3.6, Python 3.10 on Ubuntu 16.xx LTS.  I believe the
smarthost is a more recent Ubuntu LTS, probably running Postfix 3.7
as the MTA.  (Yes, they have sufficiently paranoid security and QA
teams. ;-)
We're using smtplib, which as far as I can tell basically has a 60s
timeout for each command.  Thus you'd think it would time out.  I
guess it could be inflooping on timeout, retry, timeout, retry, but
I don't know how to check that.  Maybe ss(8) will serve?
My system where SIGUSR1 works as documented is Python 3.11.2 on a
Digital Ocean droplet with Debian 12.9, Linux 6.1.0.
I haven't tried to reproduce the Python 3.10 + Mailman 3.3.6
configuration and test SIGUSR1 in that configuration yet.  Seems
unlikely, waiting for the proverbial "round tuit".
It has occurred to us to use the local Postfix as the relay MTA.
I'm waiting on a report whether that alleviates the problem.  Even
if so, I want to fix the underlying defect if possible.

Any ideas would be welcome, including general debugging advice.
Here's the promised slice_inspector tool.  (Patches and suggestions
welcome, though I don't promise to implement any time soon.)  The tool
imports some files from Mailman core, click, and psutil (not psutils!)
I believe the latter modules are required by Mailman, so you should be
able to run it as is under the 'mailman' user (or perhaps 'list' on
Debian) in the environment Mailman itself uses.  It defaults to
assuming that mailman.cfg is /etc/mailman3/mailman.cfg.  You'll
probably need to fix the shebang if you want to chmod +x.  The
'--help' should be pretty self-explanatory.
Steve

Stephen J. Turnbull

Jeremy Stanley

tags

participants (2)