
I see Mark answered some of the same questions already, but it would be really painstaking to avoid duplication (it took more than an hour to write this :-), so I'm just gonna scan it quickly and then send.
Stephan Krinetzki writes:
My queues:
(Omitting the empty ones.)
/opt/mailman/var/queue/out: total 2868 drwxrwx--- 2 mailman mailman 4096 Jul 31 13:26 . drwxr-xr-x 14 mailman mailman 165 Jun 27 2024 .. -rw-rw---- 1 mailman mailman 221708 Jul 30 13:16 1753874180.0845337+0cc7849043859a79dc3678a0d8b63c1c66df0c66.pck.tmp -rw-rw---- 1 mailman mailman 733425 Jul 31 00:00 1753912841.8847518+da55f8789ae41b75d19a02b595e3fd6d45983ade.pck.tmp -rw-rw---- 1 mailman mailman 17758 Jul 31 13:26 1753961185.0481033+6adf966467266567275f1146f5054c95e4365c13.pck
Those two .pck.tmp files are bad news. They indicate that Mailman was trying to do something with those messages, and the process was interrupted. You should check whether there was a Mailman restart at those times (although it should shutdown gracefully and not leave .tmp files behind), or if a runner crashed.
The .tmp files *may* be deliverable, but you'd have to look at them to be sure that they are complete. It's possible that they have been delivered already and the .tmp files just need to be removed. You can look at them with "mailman qfile" same as always, "qfile" doesn't check the filename extension. If they haven't been delivered and a careful check shows they're intact, just renaming without the .tmp will cause them to go to the head of the queue.
It's also odd that the .pck above precedes the ,bak below (unless you have multiple slices for the out queue?)
The rest of the queue looks normal, except that it seems rather long. I only see queues that long when the outgoing MTA is borked. (Although my experience with high-traffic systems is restricted to helping a couple of folks for whom there was zero cost to adding CPUs and memory to their VMs, I feel better that Mark picked up on this too.) You might want to reconfigure Mailman to use more out slices, but that depends on what else your MTA should be using its bandwidth for. Because of the way the slicing algorithm works, the number of slices needs to be a power of 2, so the number of simultaneous connections Mailman makes to the MTA will double. (I don't think there's any point to more than 4 slices unless you're doing more than one incoming post/second.)
-rw-rw---- 1 mailman mailman 31122 Jul 31 13:26 1753961185.1064982+b1f502e2af56b9b11680135c6de5fcc5285d967e.bak
This .bak file is currently being processed by Mailman, it's normal. The rest of the .pck files are also normal, just waiting. (Omitting the rest of the out queue listing.)
/opt/mailman/var/queue/shunt: total 3304 drwxrwx--- 2 mailman mailman 4096 Jul 31 11:44 . drwxr-xr-x 14 mailman mailman 165 Jun 27 2024 .. -rw-rw---- 1 mailman mailman 451 Jul 31 00:00 1753912822.2651796+28eceef7e18eb70393377b88dc7117af8f9362a0.pck -rw-rw---- 1 mailman mailman 490 Jul 31 00:00 1753912838.4197352+ea531cf0262c1faa58b1679b907fee92bc16822c.pck -rw-rw---- 1 mailman mailman 1407870 Jul 31 00:00 1753912841.9177196+ccea15bdefce3a54301281c8eddf86e8230244a6.pck -rw-rw---- 1 mailman mailman 86108 Jul 31 00:00 1753912841.9197443+7dcef4febc71e44c6d9309a24a08b08753e1ff42.pck -rw-rw---- 1 mailman mailman 1407668 Jul 31 00:00 1753912841.9849963+a3f1869b750060c97262ece38737480d91652828.pck -rw-rw---- 1 mailman mailman 38992 Jul 31 00:01 1753912860.5167956+940584c4f361cbd8c29e390b2f60590558effe40.pck -rw-rw---- 1 mailman mailman 440 Jul 31 00:01 1753912860.6972685+635c065bac8dff5f9d562275d707001d773b84c1.pck -rw-rw---- 1 mailman mailman 445 Jul 31 00:01 1753912868.7903054+befa066f254d7a3529a8555a6c942a554715d837.pck -rw-rw---- 1 mailman mailman 33494 Jul 31 00:01 1753912878.7562895+82b724fd93260ab9a2bb49709d3a42a2f32f2c80.pck -rw-rw---- 1 mailman mailman 217073 Jul 31 00:01 1753912878.9337828+de73a65d9c6febfa80275853921f4b53fd1d9e2a.pck -rw-rw---- 1 mailman mailman 85888 Jul 31 00:02 1753912950.303359+8a87bce0be63ac1df8493c6b1ad6ae154fcedba7.pck -rw-rw---- 1 mailman mailman 50244 Jul 31 00:02 1753912950.4970112+d44d493912bb3024547b8a5112f86f035dcb352f.pck -rw-rw---- 1 mailman mailman 12887 Jul 31 00:02 1753912970.038427+31bcbd7fb2ebdf81f6de24b7283b50bcda6ded21.pck.tmp -rw-rw---- 1 mailman mailman 443 Jul 31 11:44 1753955094.900898+67bc76525412da66a7c76363f65f583989716305.pck
/opt/mailman/var/queue/virgin: total 32 drwxrwx--- 2 mailman mailman 81 Jul 31 13:17 . drwxr-xr-x 14 mailman mailman 165 Jun 27 2024 .. -rw-rw---- 1 mailman mailman 32013 Jan 11 2025 1736550035.6163204+472f81ece5e45a2651a4499bef418f611b43c619.pck.tmp
Nothing special there (shunt should be checked, but not in correlation with my mail).
I tend to disagree, as the first series of shunt files ends with a .tmp. There's another one of those .tmp files in virgin, and it's 6 months old. Hmmm, that one is *also* on the hour. You got lots of cron jobs that run on the hour, maybe?
You're probably right that there's no correlation, but you can't trust the dates from ls -l or stat because when running "mailman unshunt" all of the queue files in shunt will get "touched" if they're not sent. (If I recall correctly.) The fact that a spate of timestamps occur right at 00:00 means either there's a cron job running unshunt then, or you have a spammer or similar sending a bunch of broken mail to you on the hour. (I say you're probably right because the time stamp in the name decodes to the same time, and I don't think that changes when unshunt is run.) And again, you have a stale .tmp file there, which means something bad happened, most likely not under Mailman's control.
There is (maybe was. by now?) a bug in the logging such that logs did not get properly rotated. Many sites dealt with this by restarting Mailman with the same period of the log rotation. Do you do that? (I'm just fishing, I don't know how it could cause the main issue you are seeing.)
mailq is empty, so my postfix works as expected.
Hm. Those "high traffic" sites I mentioned, it was the other way around: with 4 (or 8) "out" slices, the out queue would be clear >80% of the time (according to "while 1; do ls -l $OUTQUEUE; sleep 5; done", nothing sophisticated). But the MTA's mail queue would typically backlog many minutes. As I said, the Mailman hosts at those sites were insanely overpowered, so your mileage will vary.
I have to think that there is a problem in the handoff between Mailman and the MTA. Why Mailman is not preserving the queuefile or alternatively logging a successful delivery to the MTA I don't have any idea off hand. I have to think the queue runner is crashing, but that doesn't explain why this happens only to certain lists.
Is Postfix delivering to the final destination itself, or does it pass on the messages to a smarthost? Is Mailman talking to the local MTA, or is it possibly talking to an MTA on a different node? I did have to diagnose a problem once where a system was misconfigured, and Mailman was talking not to the local Postfix but to a Postfix in a datacenter a megameter or so away! (That didn't lose any mail, but the connection would occasionally freeze and not time out, leading to a huge build up in the out queue.) Anyway, if Mailman isn't talking to an MTA with a <50ms ping time, you could try changing the configuration so it does.
I don't put much stock in any of the above ideas. I hope that you or somebody come up with better ones!
-- GNU Mailman consultant (installation, migration, customization) Sirius Open Source https://www.siriusopensource.com/ Software systems consulting in Europe, North America, and Japan