Mailman sends Welcome emails, but does not send [List] postings emails
Having recently installed this same version of mailman (GNU Mailman 3.3.8 (Tom Sawyer)) a few times successfully, I have a puzzle with a current new installation. Via the command line I can create lists and add members, and the new members receive their Welcome message from mailman. But when a member sends an email to post to a list (e.g. hc@cs.org as shown here), the mailserver delivers it to mailman: May 29 16:36:58 localhost postfix/lmtp[44386]: E762E427CB: to=<hc@cs.org>, relay=127.0.0.1[127.0.0.1]:8024, delay=0.84, delays=0.8/0/0.01/0.03, dsn=2.0.0, status=sent (250 Ok) but no copy is sent to the members / subscribers. After the posts are received by mailman (as recorded in /var/tmp/mailman/logs/smtp.log), nothing more is added in any of the log files in /var/tmp/mailman/logs/... An attempt to imitate a technique seen elsewhere for [logging.smtp] for leaving some breadcrumbs by adding into mailman.cfg:
[logging.subscribe] level: debug path: mailman.log
dump_json('http://localhost:8001/3.1/lists/hc.cs.org' ... '/config/default_member_action', ... dict(default_member_action='accept'), 'PATCH', ... username='...', password='...') date: Mon, 29 May 2023 16:31:31 GMT server: gunicorn status: 204
and restarting mailman led to no detectable change. Expecting no change, I even changed for one UNmoderated list: the default_member_action from 'defer' to 'accept' to rule out some moderation process delay.
How can I enable some kind of trace or gather more information to learn where this configuration is in error?
Cheers, Nelson
Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues: https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.
In the absence of a hypothetical command: mailman status --verbose about which diagnostic tools or clues should I learn?
On Wed, May 31, 2023 at 4:18 PM <justfixit@marilynstrother.com> wrote:
Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues:
https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.
Mailman3 is in two parts which are completely INDEPENDENT:
- Mailman Core - Details at -> https://docs.mailman3.org/en/latest/install/virtualenv.html#installing-mailm...
- Maiman Web - Details at https://docs.mailman3.org/en/latest/install/virtualenv.html#installing-web-u...
Only Mailman Web requires a web server. To run a mailing list, you need Mailman Core+Database+MTA. No Webserver is required.
In the absence of a hypothetical command:
mailman status --verbose about which diagnostic tools or clues should I learn?
You should learn how to configure the logging for Mailman in mailman.cfg and how to read and interpret the information written to the logfiles :-)
PS: I know MM3 is a big animal and isn't easy to understand. I have been through that - and still there. With time you'll just understand the bits so don't despair.
-- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", egrep -v '^$|^.*#' ¯\_(ツ)_/¯ :-) [How to ask smart questions: http://www.catb.org/~esr/faqs/smart-questions.html]
... learn how to configure the logging ... On the chance it may be helpful for someone else, it seems like the best collection of information about logging is in: https://docs.mailman3.org/projects/mailman/en/latest/build/lib/mailman/confi... where one can see that my trial-and-error approach with [logging.subscribe] would be expected to have failed, since it provides "Information about leaves/joins" but not what I had imagined of "information from implementing the current subscriptions".
On 5/31/23 06:17, justfixit@marilynstrother.com wrote:
Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues: https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.
I suspect if you had looked, you would have found the messages in
Mailman's pipeline
queue, and they were not being processed because
the pipeline
runner wasn't running, although it is possible that the
issue was with the in
runner/queue. It wasn't the out
runner/queue
if welcome messages were being delivered.
The fact that the web server wasn't running is just a coincidence. It has nothing to do with Mailman core's processing of messages with the one exception that Mailman core needs to communicate with HyperKitty's API via mailman_hyperkitty, but this only affects archiving to HyperKitty and the addition of an Archived-At: header to messages.
In the absence of a hypothetical command: mailman status --verbose about which diagnostic tools or clues should I learn?
Logs are possibly the most useful. You might find something in mailman.log about the death of whatever runner had died. Looking at queues is also useful.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Logs are possibly the most useful. You might find something in mailman.log about the death of whatever runner had died. Looking at queues is also useful. FWIW, looking back at this time period in mailman.log the only instances of "runner exiting" are in groups, at the times when I had manually issued "mailman restart" to try a few different debug logging options. I.e. there was no single / multiple record of "runner exiting" from e.g. an unexpected program fault.
On 5/31/23 22:49, justfixit@marilynstrother.com wrote:
FWIW, looking back at this time period in mailman.log the only instances of "runner exiting" are in groups, at the times when I had manually issued "mailman restart" to try a few different debug logging options. I.e. there was no single / multiple record of "runner exiting" from e.g. an unexpected program fault.
When you issued a restart and the logs show runners restarting, did all
the runners restart or was one or more missing. If you grep the log for
runner started
, you should see groups of messages like
Jun 01 08:00:27 2023 (21930) in runner started. Jun 01 08:00:27 2023 (21938) virgin runner started. Jun 01 08:00:28 2023 (21932) bounces runner started. Jun 01 08:00:28 2023 (21935) command runner started. Jun 01 08:00:28 2023 (21941) retry runner started. Jun 01 08:00:28 2023 (21934) archive runner started. Jun 01 08:00:28 2023 (21931) lmtp runner started. Jun 01 08:00:28 2023 (21942) pipeline runner started. Jun 01 08:00:29 2023 (21939) out runner started. Jun 01 08:00:29 2023 (21936) digest runner started. Jun 01 08:00:29 2023 (21937) nntp runner started. Jun 01 08:00:30 2023 (21940) task runner started.
and if any runners are sliced, there will be multiple entries for that
runner, one per slice. Also, because of the way the rest
runner is
run, it will be in the list on a start but not on a restart.
If a runner had previously died so it was not running when you did the
restart, it won't be started and will be missing from the above list. In
order to restart runners which have died, you need to do mailman stop
followed by mailman start
as opposed to mailman restart
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
As you had suggested, the in
runner/queue was missing in action. Given
your insight about the limitations of mailman restart
, I looked back
at the log from the time of the preceding mailman start
, and found
that there was NOT a line such as:
in runner started.
in that instance! Watching the logs on each subsequent mailman start
,
the good news is that the in
runner/queue has been started each time;
the bad news is that slightly more than half of the time one or two
runners are NOT shown as starting. The other runners/queues sometimes
missing in action per the log file have been retry
, task
, nntp
,
archive
, and pipeline
.
No errors are being recorded in the mailman log files. This is GNU
Mailman 3.3.8 via pip install mailman
on Debian 5.10.179-1
(2023-05-12) running on a shared system where VMware gives this server
enough cycles that mailman start
and mailman stop
each consume from
20 minutes to an hour of wall clock time, so I do not issue those
commands recreationally, attempting to keep the system available for
users. What should I do to help understand the cause for these failures?
Would not it be helpful for this limitation of restart to be included in:
mailman restart --help
with a suggestion to use mailman stop
and mailman start
instead?
On 6/1/23 11:16, Mark Sapiro wrote:
When you issued a restart and the logs show runners restarting, did all the runners restart or was one or more missing.
On 6/6/23 07:58, Nelson Strother wrote:
No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via
pip install mailman
on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles thatmailman start
andmailman stop
each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?
If a runner has died, its death and a reason should be logged in Mailman's var/logs/mailman.log with a message similar to
Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)
This may not be the case if the runner is killed by the OS for an out of memory or similar reason. For this, look in syslog.
mailman stop
can take a long time because it is waiting for a runner
to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that
issue was fixed long ago. I don't understand why mailman start
would
take more time than mailman restart
. In fact, mailman restart
effectively does stop
and start
, but only for those runners which
are running.
Since you seem to frequently have missing runners, I suspect something
like an OOM condition is causing the OS to kill them. Although, I wonder
if you are correctly interpreting the logs. While the absence of the
retry
, task
, nntp
and archive
runners might not be noticed
except for messages not being archived, if either the in
or pipeline
runner is not running, no list posts will be processed.
Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use
mailman stop
andmailman start
instead?
I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for this.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 6/6/23 11:47, Mark Sapiro wrote:
Although, I wonder if you are correctly interpreting the logs. While the absence of the
retry
,task
,nntp
andarchive
runners might not be noticed except for messages not being archived, if either thein
orpipeline
runner is not running, no list posts will be processed.
Ignore that. I was thinking of a different thread with different symptoms when I wrote it.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Neither .../logs/mailman.log nor /var/log/syslog contain any recording
of "Master detected ..." nor process exits other than at the time of
mailman stop
commands. Watching during these transitions with vmstat
shows a decrease in free pages and an increase in swapping, but no
definitive hint of the OOM reaper running.
I did not intend to imply that mailman start
(predictably) consumes
more time than mailman restart
. Both are lengthy and variable; I think
one cause for mailman stop
consuming more time than usual is when
mailman receives a message as it is trying to stop, which is understandable.
Interestingly if I use systemctl start mailman
thus far the results are:
- the wall clock duration is shorter than when
mailman start
is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner process starting from .../mailman/.local/bin/runner and ending with "flufl.lock._lockfile.NotLockedError: Already unlocked".
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
On 6/6/23 14:47, Mark Sapiro wrote:
On 6/6/23 07:58, Nelson Strother wrote:
No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via
pip install mailman
on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles thatmailman start
andmailman stop
each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?If a runner has died, its death and a reason should be logged in Mailman's var/logs/mailman.log with a message similar to
Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)
This may not be the case if the runner is killed by the OS for an out of memory or similar reason. For this, look in syslog.
mailman stop
can take a long time because it is waiting for a runner to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that issue was fixed long ago. I don't understand whymailman start
would take more time thanmailman restart
. In fact,mailman restart
effectively doesstop
andstart
, but only for those runners which are running.Since you seem to frequently have missing runners, I suspect something like an OOM condition is causing the OS to kill them. Although, I wonder if you are correctly interpreting the logs. While the absence of the
retry
,task
,nntp
andarchive
runners might not be noticed except for messages not being archived, if either thein
orpipeline
runner is not running, no list posts will be processed.Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use
mailman stop
andmailman start
instead?I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for this.
Nelson Strother writes:
I did not intend to imply that
mailman start
(predictably) consumes more time thanmailman restart
. Both are lengthy and variable;
You did say something about "30min" I believe. I don't understand why it would take anywhere near that long. For example, I'm working with a site with ~20,000 lists, and it only takes a few seconds.
Oh ... wait ... are you using Postfix as the MTA? The supplied Postfix MTA interface rebuilds the list aliases database on every (re)start, and that's an O(#lists^2) process the way Mailman does it. Postfix doesn't actually use the file to route mail, it always goes to lmtp:127.0.01 or something like that. So I configured Postfix to probe for existence of the list in Mailman's database.
I think Postfix has plugins for all Mailman-supported databases (PostgreSQL, MySQL/MariaDB, SQLite3). I can provide the changes needed for this for PostgreSQL right away, if you're a MySQL site I can give you the PostgreSQL config to give you the idea, but I may be sslow if you want advice about the MySQL config. It's pretty straightforward, though.
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
I don't understand them either. The only thing I can think of is that a runner may try to access Mailman core, timeout, and crash during the long startup.
At least the MTA reconfig would make the restart cycle a lot less painful. Note: this is NOT a Mailman-supported config (or even contrib quality) yet, but I'm pretty confident of the theory and it is working well in QA testing.
Regarding the "alpha" status of the config, one Mailman 3 instance can support multiple domains, but the Postfix docs say not to hit the database for the "domain exists" check. If you aren't going to add additional Mailman domains, then this doesn't matter (the domain list file produced by the stock Mailman postfix config will work), but if you do add or remove domains, you either need to configure Postfix to hit the database for the "domain exists" check too contrary to Postfix's advice, or you'd need to manage the domain list file by hand.
In any case, you could use this config for debugging, and go back to the tried and true aliases file approach for production. And as I indicated I'm somewhat available on a best-effort basis for trouble-shooting.
Yes, Postfix is also running, but this system currently has a trivially
small configuration (single domain, single digit number of lists). When
the CPU is kept ~100% busy for (maybe once as few as a) dozen minutes to
(usually 20 minutes, but rarely) an hour [1] during mailman start
,
essentially all of the CPU time is being consumed by the dozen or so
mailman runner processes, so I do not think Postfix alias processing has
much if anything to do with these delays.
[1] wall clock time; as an arbitrary indicator of how many cycles VMWare is giving to this server, here is what 7-zip observes:
x@localhost:~$ 7za b
7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,1 CPU Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz (406F0),ASM,AES-NI)
Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz (406F0) CPU Freq: - - - - - - - - -
RAM size: 425 MB, # CPU hardware threads: 1 RAM usage: 219 MB, # Benchmark threads: 1
Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 2909 99 2861 2831 | 29117 99 2502 2486 23: 2370 88 2741 2415 | 28833 99 2512 2496 24: 1937 79 2622 2083 | 27937 98 2501 2453 ---------------------------------- | ------------------------------ Avr: 89 2741 2443 | 99 2505 2478 Tot: 94 2623 2461 x@localhost:~$
On 6/8/23 10:14, Stephen J. Turnbull wrote:
Nelson Strother writes:
I did not intend to imply that
mailman start
(predictably) consumes more time thanmailman restart
. Both are lengthy and variable;You did say something about "30min" I believe. I don't understand why it would take anywhere near that long. For example, I'm working with a site with ~20,000 lists, and it only takes a few seconds.
Oh ... wait ... are you using Postfix as the MTA? The supplied Postfix MTA interface rebuilds the list aliases database on every (re)start, and that's an O(#lists^2) process the way Mailman does it. Postfix doesn't actually use the file to route mail, it always goes to lmtp:127.0.01 or something like that. So I configured Postfix to probe for existence of the list in Mailman's database.
I think Postfix has plugins for all Mailman-supported databases (PostgreSQL, MySQL/MariaDB, SQLite3). I can provide the changes needed for this for PostgreSQL right away, if you're a MySQL site I can give you the PostgreSQL config to give you the idea, but I may be sslow if you want advice about the MySQL config. It's pretty straightforward, though.
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
I don't understand them either. The only thing I can think of is that a runner may try to access Mailman core, timeout, and crash during the long startup.
At least the MTA reconfig would make the restart cycle a lot less painful. Note: this is NOT a Mailman-supported config (or even contrib quality) yet, but I'm pretty confident of the theory and it is working well in QA testing.
Regarding the "alpha" status of the config, one Mailman 3 instance can support multiple domains, but the Postfix docs say not to hit the database for the "domain exists" check. If you aren't going to add additional Mailman domains, then this doesn't matter (the domain list file produced by the stock Mailman postfix config will work), but if you do add or remove domains, you either need to configure Postfix to hit the database for the "domain exists" check too contrary to Postfix's advice, or you'd need to manage the domain list file by hand.
In any case, you could use this config for debugging, and go back to the tried and true aliases file approach for production. And as I indicated I'm somewhat available on a best-effort basis for trouble-shooting.
Mailman-users mailing list --mailman-users@mailman3.org To unsubscribe send an email tomailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/ Archived at:https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/...
This message sent tojustfixit@marilynstrother.com
On 6/7/23 21:30, Nelson Strother wrote:
Interestingly if I use
systemctl start mailman
thus far the results are:
- the wall clock duration is shorter than when
mailman start
is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner process starting from .../mailman/.local/bin/runner and ending with "flufl.lock._lockfile.NotLockedError: Already unlocked".I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
The NotLockedError is most likely because a lock has been set with a lifetime and its lifetime has passed and some other process has broken the lock before the original process tries to unlock it. This probably has something to do with the long times to do start/restart.
However, the only lock that should be involved is the master lock set by the master watcher, so I don't really understand what might be happening.
What is the full traceback from these errors?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Here are the (four identical, but for timestamps and process numbers, at a glance) tracebacks, and some related info:
--- from /var/log/syslog Jun 8 03:35:47 localhost mailman[13107]: Traceback (most recent call last): Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:35:48 localhost mailman[13107]: sys.exit(main()) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:35:48 localhost mailman[13107]: return self.main(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:35:48 localhost mailman[13107]: rv = self.invoke(ctx) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:35:48 localhost mailman[13107]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:35:48 localhost mailman[13107]: return __callback(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:35:48 localhost mailman[13107]: return f(get_current_context(), *args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:35:49 localhost mailman[13107]: initialize(config_file, verbose) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:35:49 localhost mailman[13107]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:35:49 localhost mailman[13107]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:35:49 localhost mailman[13107]: return database Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:35:49 localhost mailman[13107]: self.unlock() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:35:49 localhost mailman[13107]: raise NotLockedError('Already unlocked') Jun 8 03:35:49 localhost mailman[13107]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:35:49 localhost dhclient[633]: RCV: Advertise message on ens192 from fe80::250:56ff:fe95:b457. Jun 8 03:36:46 localhost mailman[13100]: Traceback (most recent call last): Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:36:46 localhost mailman[13100]: sys.exit(main()) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:36:46 localhost mailman[13100]: return self.main(*args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:36:46 localhost mailman[13100]: rv = self.invoke(ctx) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:36:46 localhost mailman[13100]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:36:46 localhost mailman[13100]: return __callback(*args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:36:46 localhost mailman[13100]: return f(get_current_context(), *args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:36:46 localhost mailman[13100]: initialize(config_file, verbose) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:36:46 localhost mailman[13100]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:36:46 localhost mailman[13100]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:36:46 localhost mailman[13100]: return database Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:36:46 localhost mailman[13100]: self.unlock() Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:36:46 localhost mailman[13100]: raise NotLockedError('Already unlocked') Jun 8 03:36:46 localhost mailman[13100]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:36:56 localhost mailman[13109]: Traceback (most recent call last): Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:36:56 localhost mailman[13109]: sys.exit(main()) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:36:56 localhost mailman[13109]: return self.main(*args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:36:56 localhost mailman[13109]: rv = self.invoke(ctx) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:36:56 localhost mailman[13109]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:36:56 localhost mailman[13109]: return __callback(*args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:36:56 localhost mailman[13109]: return f(get_current_context(), *args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:36:56 localhost postfix/anvil[13147]: statistics: max connection rate 1/60s for (smtp:163.123.143.10) at Jun 8 03:33:27 Jun 8 03:37:00 localhost postfix/anvil[13147]: statistics: max connection count 1 for (smtp:163.123.143.10) at Jun 8 03:33:27 Jun 8 03:37:00 localhost mailman[13109]: initialize(config_file, verbose) Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:37:01 localhost mailman[13109]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:37:01 localhost mailman[13109]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:37:01 localhost mailman[13109]: return database Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:37:01 localhost mailman[13109]: self.unlock() Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:37:01 localhost mailman[13109]: raise NotLockedError('Already unlocked') Jun 8 03:37:01 localhost mailman[13109]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:37:01 localhost postfix/anvil[13147]: statistics: max cache size 1 at Jun 8 03:33:27 Jun 8 03:37:01 localhost mailman[13110]: Traceback (most recent call last): Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:37:01 localhost mailman[13110]: sys.exit(main()) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:37:01 localhost mailman[13110]: return self.main(*args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:37:01 localhost mailman[13110]: rv = self.invoke(ctx) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:37:01 localhost mailman[13110]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:37:01 localhost mailman[13110]: return __callback(*args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:37:01 localhost mailman[13110]: return f(get_current_context(), *args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:37:01 localhost mailman[13110]: initialize(config_file, verbose) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:37:01 localhost mailman[13110]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:37:01 localhost mailman[13110]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:37:01 localhost mailman[13110]: return database Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:37:01 localhost mailman[13110]: self.unlock() Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:37:01 localhost mailman[13110]: raise NotLockedError('Already unlocked') Jun 8 03:37:01 localhost mailman[13110]: flufl.lock._lockfile.NotLockedError: Already unlocked --- note that processes 13100, 13107, 13109, and 13110 are the ones missing from the sequential process numbers shown by systemctl status mailman below, were presumably momentarily the missing command, digests, retry, and virgin runners ...
-- all of the recent lines from /var/tmp/mailman/logs/mailman.log : Jun 08 03:24:17 2023 (4268) Master stopped Jun 08 03:30:18 2023 (13093) Master started Jun 08 03:43:31 2023 (13108) task runner started. Jun 08 03:43:52 2023 (13108) Task runner evicted 0 expired pendings Jun 08 03:43:52 2023 (13108) Task runner deleted 0 orphaned workflows Jun 08 03:43:53 2023 (13108) Task runner deleted 0 orphaned requests Jun 08 03:43:54 2023 (13108) Task runner deleted 0 orphaned messages Jun 08 03:43:54 2023 (13108) Task runner evicted expired cache entries Jun 08 03:44:17 2023 (13101) in runner started. Jun 08 03:44:23 2023 (13105) pipeline runner started. Jun 08 03:44:54 2023 (13103) nntp runner started. Jun 08 03:45:59 2023 (13098) archive runner started. Jun 08 03:46:06 2023 (13104) out runner started. Jun 08 03:46:23 2023 (13099) bounces runner started. Jun 08 03:46:43 2023 (13106) rest runner started. [2023-06-08 03:46:43 +0000] [13106] [INFO] Starting gunicorn 20.1.0 [2023-06-08 03:46:43 +0000] [13106] [INFO] Listening at: http://127.0.0.1:8001 (13106) [2023-06-08 03:46:43 +0000] [13106] [INFO] Using worker: sync [2023-06-08 03:46:44 +0000] [13345] [INFO] Booting worker with pid: 13345 [2023-06-08 03:46:44 +0000] [13346] [INFO] Booting worker with pid: 13346 Jun 08 03:46:46 2023 (13102) lmtp runner started. Jun 08 04:44:04 2023 (13108) Task runner evicted 0 expired pendings Jun 08 04:44:04 2023 (13108) Task runner deleted 0 orphaned workflows Jun 08 04:44:05 2023 (13108) Task runner deleted 0 orphaned requests Jun 08 04:44:05 2023 (13108) Task runner deleted 0 orphaned messages Jun 08 04:44:05 2023 (13108) Task runner evicted expired cache entries {with these last five lines being repeated hourly since ...} --
root@localhost:~# systemctl status mailman ● mailman.service - GNU Mailing List Manager Loaded: loaded (/etc/systemd/system/mailman.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2023-06-08 03:30:18 UTC; 13h ago Process: 13092 ExecStart=/var/tmp/mailman/.local/bin/mailman --config /var/tmp/mailman/var/etc/mailman.cfg start (code=exited, status=0/SUCCESS) Main PID: 13093 (python3) Tasks: 17 (limit: 465) Memory: 73.5M CPU: 15min 23.511s CGroup: /system.slice/mailman.service ├─13093 /usr/bin/python3 /var/tmp/mailman/.local/bin/master -C /var/tmp/mailman/var/etc/mailman.cfg ├─13098 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=archive:0:1 ├─13099 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=bounces:0:1 ├─13101 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=in:0:1 ├─13102 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=lmtp:0:1 ├─13103 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=nntp:0:1 ├─13104 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=out:0:1 ├─13105 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=pipeline:0:1 ├─13106 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1 ├─13108 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=task:0:1 ├─13345 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1 └─13346 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1
Jun 08 03:37:01 localhost mailman[13110]: initialize_2(propagate_logs=propagate_logs) Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 08 03:37:01 localhost mailman[13110]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 08 03:37:01 localhost mailman[13110]: return database Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 08 03:37:01 localhost mailman[13110]: self.unlock() Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 08 03:37:01 localhost mailman[13110]: raise NotLockedError('Already unlocked') Jun 08 03:37:01 localhost mailman[13110]: flufl.lock._lockfile.NotLockedError: Already unlocked root@localhost:~#
On 6/8/23 13:02, Mark Sapiro wrote:
On 6/7/23 21:30, Nelson Strother wrote:
Interestingly if I use
systemctl start mailman
thus far the results are: process starting from .../mailman/.local/bin/runner and ending with
- the wall clock duration is shorter than when
mailman start
is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner"flufl.lock._lockfile.NotLockedError: Already unlocked".
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.
The NotLockedError is most likely because a lock has been set with a lifetime and its lifetime has passed and some other process has broken the lock before the original process tries to unlock it. This probably has something to do with the long times to do start/restart.
However, the only lock that should be involved is the master lock set by the master watcher, so I don't really understand what might be happening.
What is the full traceback from these errors?
On 6/8/23 11:16, Nelson Strother wrote:
Here are the (four identical, but for timestamps and process numbers, at a glance) tracebacks, and some related info:
--- from /var/log/syslog Jun 8 03:35:47 localhost mailman[13107]: Traceback (most recent call last): Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:35:48 localhost mailman[13107]: sys.exit(main()) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:35:48 localhost mailman[13107]: return self.main(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:35:48 localhost mailman[13107]: rv = self.invoke(ctx) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:35:48 localhost mailman[13107]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:35:48 localhost mailman[13107]: return __callback(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:35:48 localhost mailman[13107]: return f(get_current_context(), *args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:35:49 localhost mailman[13107]: initialize(config_file, verbose) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:35:49 localhost mailman[13107]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:35:49 localhost mailman[13107]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:35:49 localhost mailman[13107]: return database Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:35:49 localhost mailman[13107]: self.unlock() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:35:49 localhost mailman[13107]: raise NotLockedError('Already unlocked') Jun 8 03:35:49 localhost mailman[13107]: flufl.lock._lockfile.NotLockedError: Already unlocked
The runners are all trying to access the Mailman database. During the access <https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/database/factor...> the database is locked. This code is executed by each runner and the lock keeps them from stepping on each other. It should only take milliseconds to execute, not minutes. The lock is created with the DEFAULT_LOCK_LIFETIME which is 15 seconds. The runner which has the lock is taking longer than 15 seconds in the create code and another waiting runner breaks the lock causing this error
I would look to your database manager for the issue.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Well, this is using the default libsqlite 3.34.1 . One improvement is to use write-ahead logging: mailman@localhost:~$ sqlite3 /var/tmp/mailman/data/mailman.db SQLite version 3.34.1 2021-01-20 14:10:07 Enter ".help" for usage hints. sqlite> PRAGMA journal_mode = WAL; wal sqlite> This setting is persistent, and based on a small sample this seems to dramatically speed up the completion of root@localhost:~# systemctl stop mailman ## consumes <2 minutes with journal_mode = WAL; root@localhost:~# ## instead of up to 20 minutes with the default journal_mode = delete But this does not make a big change the completion time for: root@localhost:~# systemctl start mailman
Since there is no point in busy waiting when running on a processor with
only one core (as this virtual server), the next setting that I would
like to try is change the sqlite busy_timeout from the default of none
to 1 or 2 seconds, e.g.:
sqlite> PRAGMA busy_timeout = 2000;
2000
sqlite>
However, this setting is not persistent, and thus needs to be applied to
each connection (i.e. tens of database connections are made during
systemctl start mailman
). I have not yet found where to insert this
directive in mailman's database wrapper for SQLite. Might someone who
has read more of the source be able to offer me a clue?
On 6/8/23 15:47, Mark Sapiro wrote:
I would look to your database manager for the issue.
Nelson Strother writes:
I have not yet found where to insert this directive in mailman's database wrapper for SQLite. Might someone who has read more of the source be able to offer me a clue?
The only place I can think of offhand is
mailman/src/mailman/database/sqlite.py
which I think contains initialization code for the sqlite database. You may also want to look at the base.py and factory.py files to see if they invoke any hooks that you might use. sqlite.py is pretty short, there may not be a model statement for you there, but there might be models in some of the other database's files. I'm not sure if the alembic subdirectory would be helpful, I would guess not. AFAIK it works exclusively on the Mailman model to SQL side, rather than the SQL to implementation side.
Finally there may be some hints in
mailman/src/mailman/config/schema.cfg
if there's something like [database.sqlite] section, or other [database*] sections.
participants (5)
-
justfixit@marilynstrother.com
-
Mark Sapiro
-
Nelson Strother
-
Odhiambo Washington
-
Stephen J. Turnbull