parallel processing of out queue messages
Hello,
I have tried increasing the number of runners to alleviate a backlog of messages in the "out" queue. I increased the out runners to from 32 to 64.
However, I'm not seeing a significant increase in the processing of "out" messages.
I've seen at most 12 ".bak" files in the out queue after increasing the runner instances.
Is there another configuration needed beyond the number of runner instances to get more messages processed at once?
Thanks as always for your help!
-- Dan
Dan Caballero writes:
I have tried increasing the number of runners to alleviate a backlog of messages in the "out" queue. I increased the out runners to from 32 to 64.
How many messages per day are you processing?! How many CPUs do you have? How much memory?
I know a site with 2 in runners and 8 out runners on something basically the same as a 8CPU 16GB premium Linode. It processes over 100,000 messages/day. The out queue occasionally gets in double digits but 99% of the time it's back at 0/1 in 3 seconds. Load average is usually around 5, CPU utilization 50-80%. I'm not sure it needs 2 in runners or 8 out runners, but it definitely needs at least 4 out runners. (We haven't tested a 4-runner configuration since resolving the "runner stall" issue described below. Virtual CPUs are not at a premium at that client.)
Is there another configuration needed beyond the number of runner instances to get more messages processed at once?
No, the slicing algorithm is trivial, the only upper bound (if you have enough memory ;-) is 2^160.
One thing we ran into on the system above was some (still unidentified) issue between Mailman and its SMTP out gateway that caused an out runner to stall for extended periods (sometimes it would restart, often not, nothing interesting in the log). Unfortunately I don't have the monitoring tool we used, but it's easy to create one.
The slicing algorithm is based on the file name, and divides the space of SHA1 hash values into N contiguous regions of each length.[1] Get a directory listing of the out queue, count each slice's length. If you see one (rarely more) that just keeps increasing, that's it.
You can also use utilities like lsof to keep tabs on the connections to the outgoing SMTP gateway (connections are identified by the source port). If one lasts more than 5 seconds, that's it.
Footnotes: [1] The slicing algorithm is in the __init__ function in mailman/src/mailman/core/switchboard.py.
Yesterday we processed just over 60,000 messages to recipients. Some of our lists have 2000+ subscribers.
90% of the time the system is handling incoming messages and the backlog of .pck files in the out queue doesn't last more than 10-15 minutes. Even if a single message may take longer to process others move through via other runners.
However, due to the rather complex relay we have from our primary mail domain server and campus relay, there are occasions when messages get significantly delayed before arriving at Mailman.
When that happens, there's a flood of messages and Mailman is unable to keep up and messages may wait to be processed for up to 2 hours.
We currently run Mailman as a Docker container in AWS on a t3.xlarge instance. 4 CPU 16GB RAM
So is there a 1:1 relationship between the number of runners and the maximum number of messages that may be processed with .bak files in the out queue?
Thanks!
-- Dan
I think I've got some clarification on this with our SMTP server administrator.
We relay the list messages through an external SMTP host. The administrator tried increasing the limit on the number of client connections as they had set it fairly low.
smtpd_client_connection_count_limit
They've increased the threshold and taht seems to have helped as I immediately saw an increasing both via lsof -i:25 and the number of .bak files in the out queue directory.
Thanks!!
Answering previous two messages in one reply.
Dan Caballero writes:
Yesterday we processed just over 60,000 messages to recipients. Some of our lists have 2000+ subscribers. We currently run Mailman as a Docker container in AWS on a t3.xlarge instance. 4 CPU 16GB RAM
OK, so that's the same order of magnitude. Interesting that 4CPUs seems to be enough.
90% of the time the system is handling incoming messages and the backlog of .pck files in the out queue doesn't last more than 10-15 minutes. Even if a single message may take longer to process others move through via other runners.
Right, but the thing is this is not a single-queue multiserver model. In queue theory terms, it's a queue-per-slice model. Load balancing is done by pseudorandomizing queue assignment. So if the head of the queue gets stuff, the whole slice is stuck, but the queue manager keeps adding to that slice.
It does sound like you're not seeing the stalling problem we did. I wonder why not, it looks like you're using Postfix, too.
So is there a 1:1 relationship between the number of runners and the maximum number of messages that may be processed with .bak files in the out queue?
I don't know about the relationship to the .bak files. I don't remember offhand how they get cleaned up. But runners handle only one message at a time. If you have 32 runners, there will be at most 32 messages being processed at a time. However, normally a runner can process several messages per second. So given 2 hours offline or whatever, 60k/12 = 5k messages would showing up at once. A single runner should be able to handle that in another two hours, so I think this indeed explained by outgoing gateway throttling.
Again, Dan Caballero writes:
We relay the list messages through an external SMTP host. The administrator tried increasing the limit on the number of client connections as they had set it fairly low.
smtpd_client_connection_count_limit
Oops, forgot about this. My client on the large site had already tweaked that because they had that setting for the Mailman 2 site we were migrating. He mentioned it but it wasn't top of mind for me because I didn't have access to the SMTP gateway host.
They've increased the threshold and that seems to have helped as I immediately saw an increasing both via lsof -i:25 and the number of .bak files in the out queue directory.
Yay!
Regards, Steve
Dan Caballero writes:
So is there a 1:1 relationship between the number of runners and the maximum number of messages that may be processed with .bak files in the out queue?
Yes. The way this works is the runner gets the first message in it's slice from the .pck file and renames the .pck to .bak. It then processes the message and upon successful completion removes the .bak. I.e., the .bak file contains the message currently being processed by that runner.
The purpose of the .bak is disaster recovery in case the runner gets killed (say by a power failure), when it starts again it recovers the .bak file.
See https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/core/switchboar... for the details of how all this, including slicing, works.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (3)
-
Dan Caballero
-
Mark Sapiro
-
Stephen J. Turnbull