Outgoing runner lockup? + "slice inspector" tool
Hi,
I'm advising a user on a system which experiences occasional lockups in an outgoing runner. The runner contacts a remote smarthost, and occasionally locks up with the smtp connection still open according to "lsof -i :25". This state seems to be permanent: the backlog for that slice starts to grow without bound, for at least an hour. When this was happening once a month, they would just restart Mailman core, and the backlog would clear in minutes. But recently it's been happening daily which is distracting their staff, and worries me.
Another thing that is strange about this site is that it should be possible to hit that runner with a SIGUSR1 and restart it. This works for me, but on that system the stuck runner exits, but does not restart.
Since normally it only happens to one runner of several, I wanted to identify the runner and process. I attach the tool I developed, for anyone who might have configured multiple runners and is interested to see the distribution across runners.
Has anybody seen an outgoing runner lock up with an open smtp session? Any ideas on why?
My analysis so far:
- Because a restart works every time, I'm pretty sure it doesn't have anything to do with message content.
- I believe both the Mailman host and the outgoing smarthost are Linodes, in the same datacenter. The problematic Mailman system is Mailman 3.3.6, Python 3.10 on Ubuntu 16.xx LTS. I believe the smarthost is a more recent Ubuntu LTS, probably running Postfix 3.7 as the MTA. (Yes, they have sufficiently paranoid security and QA teams. ;-)
- We're using smtplib, which as far as I can tell basically has a 60s timeout for each command. Thus you'd think it would time out. I guess it could be inflooping on timeout, retry, timeout, retry, but I don't know how to check that. Maybe ss(8) will serve?
- My system where SIGUSR1 works as documented is Python 3.11.2 on a Digital Ocean droplet with Debian 12.9, Linux 6.1.0.
- I haven't tried to reproduce the Python 3.10 + Mailman 3.3.6 configuration and test SIGUSR1 in that configuration yet. Seems unlikely, waiting for the proverbial "round tuit".
- It has occurred to us to use the local Postfix as the relay MTA. I'm waiting on a report whether that alleviates the problem. Even if so, I want to fix the underlying defect if possible.
Any ideas would be welcome, including general debugging advice.
Here's the promised slice_inspector tool. (Patches and suggestions welcome, though I don't promise to implement any time soon.) The tool imports some files from Mailman core, click, and psutil (not psutils!) I believe the latter modules are required by Mailman, so you should be able to run it as is under the 'mailman' user (or perhaps 'list' on Debian) in the environment Mailman itself uses. It defaults to assuming that mailman.cfg is /etc/mailman3/mailman.cfg. You'll probably need to fix the shebang if you want to chmod +x. The '--help' should be pretty self-explanatory.
Steve
On 2025-02-01 15:23:31 +0000 (+0000), Stephen J. Turnbull wrote: [...]
Another thing that is strange about this site is that it should be possible to hit that runner with a SIGUSR1 and restart it. This works for me, but on that system the stuck runner exits, but does not restart. [...]
- I haven't tried to reproduce the Python 3.10 + Mailman 3.3.6 configuration and test SIGUSR1 in that configuration yet. Seems unlikely, waiting for the proverbial "round tuit". [...] Any ideas would be welcome, including general debugging advice. [...]
It's a bit of hacking, but in other Python-based daemons I work on we've implemented a debug signal handler that can produce tracebacks or thread dumps from the running process. Finding out what the interpreter is executing at a particular point in time can provide some useful insight into where the underlying bug might be hiding.
-- Jeremy Stanley
participants (2)
-
Jeremy Stanley
-
Stephen J. Turnbull