Dan Caballero writes:
I have tried increasing the number of runners to alleviate a backlog of messages in the "out" queue. I increased the out runners to from 32 to 64.
How many messages per day are you processing?! How many CPUs do you have? How much memory?
I know a site with 2 in runners and 8 out runners on something basically the same as a 8CPU 16GB premium Linode. It processes over 100,000 messages/day. The out queue occasionally gets in double digits but 99% of the time it's back at 0/1 in 3 seconds. Load average is usually around 5, CPU utilization 50-80%. I'm not sure it needs 2 in runners or 8 out runners, but it definitely needs at least 4 out runners. (We haven't tested a 4-runner configuration since resolving the "runner stall" issue described below. Virtual CPUs are not at a premium at that client.)
Is there another configuration needed beyond the number of runner instances to get more messages processed at once?
No, the slicing algorithm is trivial, the only upper bound (if you have enough memory ;-) is 2^160.
One thing we ran into on the system above was some (still unidentified) issue between Mailman and its SMTP out gateway that caused an out runner to stall for extended periods (sometimes it would restart, often not, nothing interesting in the log). Unfortunately I don't have the monitoring tool we used, but it's easy to create one.
The slicing algorithm is based on the file name, and divides the space of SHA1 hash values into N contiguous regions of each length.[1] Get a directory listing of the out queue, count each slice's length. If you see one (rarely more) that just keeps increasing, that's it.
You can also use utilities like lsof to keep tabs on the connections to the outgoing SMTP gateway (connections are identified by the source port). If one lasts more than 5 seconds, that's it.
Footnotes: [1] The slicing algorithm is in the __init__ function in mailman/src/mailman/core/switchboard.py.