out queue backlog and delivery rate
We experienced an issue where our upstream mail protection gateway had an issue for a few hours. As a result out Mailman system was inundated with hundreds of messages within half an hour.
Mail delivery is continuing at a snails pace and it is now 4 hours behind.
I noticed that the system isn't overloaded either via CPU or Memory. It's just slowly "working on it".
Are there rate limits that can be tuned for Mailman queues such as the "out" queue? Could this be IO related somehow? I'd expect there to be a CPU spike in that case.
Please advise. Thanks in advance.
Dan Caballero Systems Administrator Academic Computing Solutions IMSS - Caltech https://imss.caltech.edu
Is there anyway to
On 3/25/22 16:17, dancab@caltech.edu wrote:
Mail delivery is continuing at a snails pace and it is now 4 hours behind.
I noticed that the system isn't overloaded either via CPU or Memory. It's just slowly "working on it".
Are there rate limits that can be tuned for Mailman queues such as the "out" queue? Could this be IO related somehow? I'd expect there to be a CPU spike in that case.
There are things that can be done. One big performance killer is MTA recipient checks during SMTP from Mailman. Look too at the various hits at <https://wiki.list.org/?action=fullsearch&value=performance&titlesearch=Titles>. These are written for older Mailman, but the MTA advice is relevant.
One thing you can do is enable an alternate SMTPD port in the MTA with minimal checking and configure Mailman to use that port.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Thanks again Mark. I'll look into tuning Postfix to speed things up a bit.
Mark, we've been testing some adjustments to Postfix. We're especially overwhelmed when several large lists are being processed within the space for 10-30 minute. You can see the added delay in the log output below when 2 distinct messages arrive within a couple of minutes of each other. The second message ended up being delayed by over 20 minutes. We have multiple lists with member over 500 members. So we've had scenarios where the backlog quickly increases to an hour.
Any thoughts?
Apr 05 20:09:14 2311185cccaa postfix/cleanup[889]: 6EC481C0EA7: message-id=<17b327d5-f77b-edf1-fdbc-49367a94b511@caltech.edu> Apr 05 20:17:17 2022 (269) <17b327d5-f77b-edf1-fdbc-49367a94b511@caltech.edu> smtp to marketplace@caltech.edu for 1615 recips, completed in 477.0642318725586 seconds
Apr 05 20:11:42 2311185cccaa postfix/cleanup[889]: E8DF61C0EA7: message-id=<606c4be2-1b74-0d00-77b9-49bf07e696d3@caltech.edu> Apr 05 20:42:10 2022 (269) <606c4be2-1b74-0d00-77b9-49bf07e696d3@caltech.edu> smtp to marketplace@caltech.edu for 1615 recips, completed in 1487.8184821605682 seconds
On 4/6/22 16:14, dancab@caltech.edu wrote:
Mark, we've been testing some adjustments to Postfix. We're especially overwhelmed when several large lists are being processed within the space for 10-30 minute. You can see the added delay in the log output below when 2 distinct messages arrive within a couple of minutes of each other. The second message ended up being delayed by over 20 minutes. We have multiple lists with member over 500 members. So we've had scenarios where the backlog quickly increases to an hour.
Any thoughts?
Slice the out queue. at least 4 slices.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark,
We're trying to alleviate these surges by increasing the number of in and out runners. Let me know what you think.
root@2311185cccaa:/var/www# ps aux |egrep runner.'(in|out)' mailman 26718 0.9 0.4 95080 73980 ? S 14:21 0:03 /opt/mailmanve/bin/python3 /opt/mailmanve/bin/runner -C /opt/mailmanve/lib/python3.9/site-packages/mailman/config/mailman.cfg --runner=in:0:2 mailman 26719 0.7 0.4 92300 71516 ? S 14:21 0:02 /opt/mailmanve/bin/python3 /opt/mailmanve/bin/runner -C /opt/mailmanve/lib/python3.9/site-packages/mailman/config/mailman.cfg --runner=in:1:2 mailman 26722 8.6 0.4 99840 78904 ? S 14:21 0:28 /opt/mailmanve/bin/python3 /opt/mailmanve/bin/runner -C /opt/mailmanve/lib/python3.9/site-packages/mailman/config/mailman.cfg --runner=out:0:2 mailman 26723 1.4 0.4 94628 75072 ? S 14:21 0:04 /opt/mailmanve/bin/python3 /opt/mailmanve/bin/runner -C /opt/mailmanve/lib/python3.9/site-packages/mailman/config/mailman.cfg --runner=out:1:2
For anyone coming across this thread. It seems increasing the runners did the job for us. The smtp.log is now showing smaller lists being completely processed while larger lists are still in progress of sending out to members.
We edited the mailman.cfg to include the following and the mailman stop; mailman start
[runner.in] class: mailman.runners.incoming.IncomingRunner instances: 2
[runner.out] class: mailman.runners.outgoing.OutgoingRunner instances: 2
On 4/8/22 09:47, dancab@caltech.edu wrote:
For anyone coming across this thread. It seems increasing the runners did the job for us. The smtp.log is now showing smaller lists being completely processed while larger lists are still in progress of sending out to members.
We edited the mailman.cfg to include the following and the mailman stop; mailman start
[runner.in] class: mailman.runners.incoming.IncomingRunner instances: 2
There's probably no issue that would be helped by increasing in runner instances.
[runner.out] class: mailman.runners.outgoing.OutgoingRunner instances: 2
I would suggest larger, 4 or even 8.
This is not a single queue/multiple server setup. It is multiple queue/multiple server. I.e. The queue entries are named based on a hash of the message and each instance processes its own slice of the hash space. So, with 2 slices and a large recipient message being processed, there's a 50% chance that the next message will wind up waiting.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
dancab@caltech.edu
-
Mark Sapiro