Mailman runners missing - nothing in logs
Hi,
since migration from mailman2 to mailman3 I had twice the following, last time today:
When sending a mail to a list I get the following reply from postfix after a while:
connect to 127.0.0.1[127.0.0.1]:8024: Connection refused
Searched on that and found some posts, where the following shell command is named to check what is running:
ps -fwwu mailman
And I get this as output:
UID PID PPID C STIME TTY TIME CMD mailman 1166 1 0 Jan07 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/master -C /opt/mailman/core/mailman.cfg mailman 1180 1166 0 Jan07 ? 00:03:09 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=rest:0:1 mailman 1212 1180 0 Jan07 ? 00:01:10 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=rest:0:1 mailman 11039 11036 0 06:25 ? 00:00:03 /usr/bin/uwsgi-core --ini mailman3.ini mailman 11064 11039 0 06:25 ? 00:00:00 /usr/bin/uwsgi-core --ini mailman3.ini mailman 11065 11039 0 06:25 ? 00:00:01 /usr/bin/uwsgi-core --ini mailman3.ini mailman 17257 1180 0 Jan12 ? 00:00:37 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=rest:0:1
Lots of expected runners are missing. After restarting the service the output of the same command is:
UID PID PPID C STIME TTY TIME CMD mailman 11039 11036 0 06:25 ? 00:00:03 /usr/bin/uwsgi-core --ini mailman3.ini mailman 11064 11039 0 06:25 ? 00:00:00 /usr/bin/uwsgi-core --ini mailman3.ini mailman 11065 11039 0 06:25 ? 00:00:01 /usr/bin/uwsgi-core --ini mailman3.ini mailman 14430 1 13 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/master -C /opt/mailman/core/mailman.cfg mailman 14434 14430 18 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=archive:0:1 mailman 14435 14430 20 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=bounces:0:1 mailman 14436 14430 18 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=command:0:1 mailman 14437 14430 17 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=in:0:1 mailman 14438 14430 17 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=lmtp:0:1 mailman 14439 14430 17 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=nntp:0:1 mailman 14440 14430 19 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=out:0:1 mailman 14441 14430 20 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=pipeline:0:1 mailman 14442 14430 20 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=rest:0:1 mailman 14443 14430 17 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=retry:0:1 mailman 14444 14430 17 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=virgin:0:1 mailman 14445 14430 18 14:50 ? 00:00:01 /opt/mailman/core/venv/bin/python3 /opt/mailman/core/venv/bin/runner -C /opt/mailman/core/mailman.cfg --runner=digest:0:1
So now back to good. I checked all the logs prior to restarting the service in directory /opt/mailman/var/log, but I do not see any errors. And the master was still running, I would expect that the master will start / restart a runner in case it died?
Fun fact: First time of this problem was two weeks from now, which was two weeks after migration to mailman3. I'm excited if it will happen in two weeks again.
Installed packages for core:
mailman (3.3.0) mailman-hyperkitty (1.1.0)
Any suggestions where I should have a deeper look and maybe find the root cause of this problem?
Thanks in advance Torge
On 1/18/20 6:13 AM, Torge Riedel wrote:
So now back to good. I checked all the logs prior to restarting the service in directory /opt/mailman/var/log, but I do not see any errors. And the master was still running, I would expect that the master will start / restart a runner in case it died?
What's in /opt/mailman/var/log/mailman.log? There should be entries like
Jan 18 13:51:11 2020 (11700) xxxx runner started.
for each runner every time it starts and like
Jan 18 13:47:31 2020 (10096) xxxx runner caught SIGTERM. Stopping. Jan 18 13:47:31 2020 (10096) xxxx runner exiting.
for each runner every time it stops, but with perhaps a reason other
than caught SIGTERM
That should at least tell you when they stopped. Also, if they stopped for some reason. The master will only restart runners that exit if they exit because of SIGUSR1 or some internal error and even then, only the configured max_restarts number of times.
Do you have a logrotate script for the logs in /opt/mailman/var/log, and
if so, does it have a postrotate script that signals Mailman to reopen
logs other than by a mailman reopen
command, perhaps by signaling the
master with other than SIGHUP?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Am 18.01.20 um 23:14 schrieb Mark Sapiro:
On 1/18/20 6:13 AM, Torge Riedel wrote:
So now back to good. I checked all the logs prior to restarting the service in directory /opt/mailman/var/log, but I do not see any errors. And the master was still running, I would expect that the master will start / restart a runner in case it died?
What's in /opt/mailman/var/log/mailman.log? There should be entries like
Jan 18 13:51:11 2020 (11700) xxxx runner started.
for each runner every time it starts and like
Jan 18 13:47:31 2020 (10096) xxxx runner caught SIGTERM. Stopping. Jan 18 13:47:31 2020 (10096) xxxx runner exiting.
for each runner every time it stops, but with perhaps a reason other than
caught SIGTERM
That should at least tell you when they stopped. Also, if they stopped for some reason. The master will only restart runners that exit if they exit because of SIGUSR1 or some internal error and even then, only the configured max_restarts number of times.
Do you have a logrotate script for the logs in /opt/mailman/var/log, and if so, does it have a postrotate script that signals Mailman to reopen logs other than by a
mailman reopen
command, perhaps by signaling the master with other than SIGHUP?
Hi Mark,
I checked the logs for such entries and I do only see them when the logrotation runs and from yesterday where I restarted the service by myself.
So - yes, I have logrotation configured. The logrotation has a postrotate script configured which executes systemctl reload mailman
and I see in the logs, that it handles SIGUSR1. This happens daily in the morning and runs - for what I see - without any problems.
If I understand you right I should change it to "reopen" instead of "reload". I think this is something I have to pass to mailman executable itself instead to systemctl, right?
Best regards Torge
participants (2)
-
Mark Sapiro
-
Torge Riedel