Postorius fails to show list because of a large message being held
Hi,
Recently a member of one of our private lists sent a message with a picture attached. Because of the large message size (23MB) it was put into the moderation queue.
But since then postorius fails to open the list's page with this error: Something went wrong Mailman REST API not available. Please start Mailman core.
The database logs show that it processed a request related to the held message just before failing:
LOG: statement: SELECT message.id AS message_id_1, message.message_id AS message_message_id, message.message_id_hash AS message_message_id_hash, message.path AS message_path FROM message WHERE message.message_id = '<CAEcCghAdp+20jiG19jNGX06xhJ-iYvZGyyV8b-toW7LqCSv5JQ@mail.gmail.com>' LIMIT 1
Mailman Core logs: [2020-07-14 10:41:21 +0000] [36] [CRITICAL] WORKER TIMEOUT (pid:49) [2020-07-14 10:41:21 +0000] [49] [INFO] Worker exiting (pid: 49) [2020-07-14 10:41:22 +0000] [50] [INFO] Booting worker with pid: 50
How should I proceed to unblock the situation?
Thanks,
_g.
On 7/14/20 6:49 AM, Gilles Filippini wrote:
Mailman Core logs: [2020-07-14 10:41:21 +0000] [36] [CRITICAL] WORKER TIMEOUT (pid:49) [2020-07-14 10:41:21 +0000] [49] [INFO] Worker exiting (pid: 49) [2020-07-14 10:41:22 +0000] [50] [INFO] Booting worker with pid: 50
I am pretty sure that timeout is related to gunicorn. If it is, then increasing the worker out timeout setting for gunicorn should work. You tend to get such errors when doing an export of a large membership roster. This is the first time I have seen it related to a held message.
I have this in my setup:
mailman.cfg:
[webservice] configuration: /opt/mailman/mm/gunicorn.cfg
gunicorn.cfg:
[gunicorn] workers = 4 timeout = 900
You will need to restart gunicorn after adjusting the timeout setting.
-- Please let me know if you need further assistance.
Thank you for your business. We appreciate our clients. Brian Carpenter EMWD.com
-- EMWD's Knowledgebase: https://clientarea.emwd.com/index.php/knowledgebase
EMWD's Community Forums http://discourse.emwd.com/
On 7/14/20 3:54 AM, Brian Carpenter wrote:
On 7/14/20 6:49 AM, Gilles Filippini wrote:
Mailman Core logs: [2020-07-14 10:41:21 +0000] [36] [CRITICAL] WORKER TIMEOUT (pid:49) [2020-07-14 10:41:21 +0000] [49] [INFO] Worker exiting (pid: 49) [2020-07-14 10:41:22 +0000] [50] [INFO] Booting worker with pid: 50
I am pretty sure that timeout is related to gunicorn. If it is, then increasing the worker out timeout setting for gunicorn should work. You tend to get such errors when doing an export of a large membership roster. This is the first time I have seen it related to a held message.
It seems weird that a gunicorn worker would run for more that the default 30 seconds because of a 23MB message, but I think Brian is correct.
I have this in my setup:
mailman.cfg:
[webservice] configuration: /opt/mailman/mm/gunicorn.cfg
gunicorn.cfg:
[gunicorn] workers = 4 timeout = 900
You will need to restart gunicorn after adjusting the timeout setting.
It's confusing, but after adjusting that setting, you need to restart Mailman core, not gunicorn.
If you are running a gunicorn service, that is what is providing WSGI support for Django (Postorius and/or HyperKitty). You may be using gunicorn for this or uWSGI or Apache mod_wsgi, but is any case, that is not the gunicorn we are considering here.
Regardless of what you use to provide WSGI support for Django, Mailman core uses gunicorn to support the REST API, and that is the gunicorn affected by the settings in the gunicorn.cfg pointed to by
[webservice] configuration: /opt/mailman/mm/gunicorn.cfg
If, with the above configuration, you do ps -fwwA|grep runner=rest
you
will see something like:
mailman 20582 20561 0 01:21 ? 00:00:16 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20722 20582 0 01:21 ? 00:01:19 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20725 20582 0 01:21 ? 00:01:18 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20726 20582 0 01:21 ? 00:01:18 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20727 20582 0 01:21 ? 00:01:50 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1
The first of these, pid 20582, is Mailman's actual REST runner. The other 4 with parent pid 20582 are the 4 gunicorn worker processes forked from the REST runner.
It's the REST runner or Mailman core that needs to be restarted to pick up that change.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 7/14/20 1:06 PM, Mark Sapiro wrote:
It's confusing, but after adjusting that setting, you need to restart Mailman core, not gunicorn.
If you are running a gunicorn service, that is what is providing WSGI support for Django (Postorius and/or HyperKitty). You may be using gunicorn for this or uWSGI or Apache mod_wsgi, but is any case, that is not the gunicorn we are considering here.
Regardless of what you use to provide WSGI support for Django, Mailman core uses gunicorn to support the REST API, and that is the gunicorn affected by the settings in the gunicorn.cfg pointed to by
[webservice] configuration: /opt/mailman/mm/gunicorn.cfg
If, with the above configuration, you do
ps -fwwA|grep runner=rest
you will see something like:mailman 20582 20561 0 01:21 ? 00:00:16 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20722 20582 0 01:21 ? 00:01:19 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20725 20582 0 01:21 ? 00:01:18 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20726 20582 0 01:21 ? 00:01:18 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1 mailman 20727 20582 0 01:21 ? 00:01:50 /opt/mailman/mm/venv/bin/python /opt/mailman/mm/venv/bin/runner -C /opt/mailman/mm/deployment/mailman.cfg --runner=rest:0:1
The first of these, pid 20582, is Mailman's actual REST runner. The other 4 with parent pid 20582 are the 4 gunicorn worker processes forked from the REST runner.
It's the REST runner or Mailman core that needs to be restarted to pick up that change.
Thanks Mark. I knew that once but forgot it.
-- Please let me know if you need further assistance.
Thank you for your business. We appreciate our clients. Brian Carpenter EMWD.com
-- EMWD's Knowledgebase: https://clientarea.emwd.com/index.php/knowledgebase
EMWD's Community Forums http://discourse.emwd.com/
Hi,
Brian Carpenter a écrit le 14/07/2020 à 12:54 :
On 7/14/20 6:49 AM, Gilles Filippini wrote:
Mailman Core logs: [2020-07-14 10:41:21 +0000] [36] [CRITICAL] WORKER TIMEOUT (pid:49) [2020-07-14 10:41:21 +0000] [49] [INFO] Worker exiting (pid: 49) [2020-07-14 10:41:22 +0000] [50] [INFO] Booting worker with pid: 50
I am pretty sure that timeout is related to gunicorn. If it is, then increasing the worker out timeout setting for gunicorn should work. You tend to get such errors when doing an export of a large membership roster. This is the first time I have seen it related to a held message.
I have this in my setup:
mailman.cfg:
[webservice] configuration: /opt/mailman/mm/gunicorn.cfg
gunicorn.cfg:
[gunicorn] workers = 4 timeout = 900
You will need to restart gunicorn after adjusting the timeout setting.
Sorry for the late answer, I was afk the last 24 hours.
I can confirm that this solution works. Thanks a lot! And thanks to Mark for the added details.
But it sounds weird that printing out the number of held messages could timeout because of these messages' size.
Best,
_g.
I'm looking to fine tune the timeout settings here as well. We're seeing that if a list has more than a few messages held then trying to release them results in a timeout. The error "message not found" occurs. The page then needs to be refreshed to see the actual count of remaining messages.
I'm considering the settings below for gunicorn.cfg
[gunicorn]
workers = 4
# No of seconds to wait before killing the worker processes. graceful_timeout = 600
# Timeout for the requests. timeout = 900
On 12/21/21 4:42 PM, dancab@caltech.edu wrote:
I'm considering the settings below for gunicorn.cfg
[gunicorn]
OK
# No of seconds to wait before killing the worker processes. graceful_timeout = 600
I don't think this is relevant. It only affects the delay before killing a worker when gunicorn is restarted, but I suppose it won't hurt.
# Timeout for the requests. timeout = 900
I wonder if this is really the issue. Do you have tracebacks from these timeouts? I think there may be something else causing this that increasing the timeout won't solve. Maybe some database locking deadlock or similar issue.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Your instincts were correct. I adjusted the Gunicorn timeouts for Postorius/ Django instead and that has alleviated the problem.
We use supervisord on our containers so I changed the gunicorn command there.
command = gunicorn --timeout 0 --graceful-timeout 300 --config /var/www/mailman-www/gunicorn_config.py wsgi
I'm also looking to speed up the processing so I'm wondering if adding more Gunicorn "workers" for Postorius would help with that.
I'll do some tests in the new year. Thanks for all your help.
Happy Holidays!
participants (4)
-
Brian Carpenter
-
dancab@caltech.edu
-
Gilles Filippini
-
Mark Sapiro