Mailman sends Welcome emails, but does not send [List] postings emails

older
Migration failure (2.1.29 to 3.3.8...

justfixit＠marilynstrother.com

May 31, 2023

11:39 a.m.

Having recently installed this same version of mailman (GNU Mailman 3.3.8 (Tom Sawyer)) a few times successfully, I have a puzzle with a current new installation. Via the command line I can create lists and add members, and the new members receive their Welcome message from mailman. But when a member sends an email to post to a list (e.g. hc@cs.org as shown here), the mailserver delivers it to mailman: May 29 16:36:58 localhost postfix/lmtp[44386]: E762E427CB: to=<hc@cs.org>, relay=127.0.0.1[127.0.0.1]:8024, delay=0.84, delays=0.8/0/0.01/0.03, dsn=2.0.0, status=sent (250 Ok) but no copy is sent to the members / subscribers. After the posts are received by mailman (as recorded in /var/tmp/mailman/logs/smtp.log), nothing more is added in any of the log files in /var/tmp/mailman/logs/... An attempt to imitate a technique seen elsewhere for [logging.smtp] for leaving some breadcrumbs by adding into mailman.cfg:

[logging.subscribe] level: debug path: mailman.log

...

...
...
dump_json('http://localhost:8001/3.1/lists/hc.cs.org' ... '/config/default_member_action', ... dict(default_member_action='accept'), 'PATCH', ... username='...', password='...') date: Mon, 29 May 2023 16:31:31 GMT server: gunicorn status: 204

and restarting mailman led to no detectable change. Expecting no change, I even changed for one UNmoderated list: the default_member_action from 'defer' to 'accept' to rule out some moderation process delay.

How can I enable some kind of trace or gather more information to learn where this configuration is in error?

Cheers, Nelson

Show replies by date

justfixit＠marilynstrother.com

May 2023

1:17 p.m.

Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues: https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.

In the absence of a hypothetical command: mailman status --verbose about which diagnostic tools or clues should I learn?

Odhiambo Washington

1:30 p.m.

On Wed, May 31, 2023 at 4:18 PM <justfixit@marilynstrother.com> wrote:

...

Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues:

https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.

Mailman3 is in two parts which are completely INDEPENDENT:

Mailman Core - Details at -> https://docs.mailman3.org/en/latest/install/virtualenv.html#installing-mailm...
Maiman Web - Details at https://docs.mailman3.org/en/latest/install/virtualenv.html#installing-web-u...

Only Mailman Web requires a web server. To run a mailing list, you need Mailman Core+Database+MTA. No Webserver is required.

In the absence of a hypothetical command:

...

mailman status --verbose about which diagnostic tools or clues should I learn?

You should learn how to configure the logging for Mailman in mailman.cfg and how to read and interpret the information written to the logfiles :-)

PS: I know MM3 is a big animal and isn't easy to understand. I have been through that - and still there. With time you'll just understand the bits so don't despair.

-- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", egrep -v '^$|^.*#' ¯\_(ツ)_/¯ :-) [How to ask smart questions: http://www.catb.org/~esr/faqs/smart-questions.html]

justfixit＠marilynstrother.com

June 2023

5:48 a.m.

...

... learn how to configure the logging ... On the chance it may be helpful for someone else, it seems like the best collection of information about logging is in: https://docs.mailman3.org/projects/mailman/en/latest/build/lib/mailman/confi... where one can see that my trial-and-error approach with [logging.subscribe] would be expected to have failed, since it provides "Information about leaves/joins" but not what I had imagined of "information from implementing the current subscriptions".

Mark Sapiro

May 2023

4:42 p.m.

On 5/31/23 06:17, justfixit@marilynstrother.com wrote:

...

Very interestng! Upon restarting the system (down momentarily for taking a snapshot) and starting Mailman again, all of these missing [List] messages emerged from their timewarp and were delivered to the members / subscribers. Just before shutting down the system I discovered that the apache2 webserver had been off the air due to a module not being enabled yet. As of yet, I was unaware of any requirement for a web server to be online (beyond possibly gunicorn) in order for Mailman to run its lists. Does that speculation have a factual basis? Or do I risk starting a vicious rumor amongst the impatient readers who are not still here once we deduce this pair of observations were merely correlated in time, with no cause-and-effect relationship? Before this situation healed itself, the thought had crossed my mind to peek and see if anything was stuck in the queues: https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/rest/docs/q... While that could have provided peace of mind that email messages were merely delayed instead of vanished, even in hindsight I do not see how that would have led to repairing the situation.

I suspect if you had looked, you would have found the messages in Mailman's pipeline queue, and they were not being processed because the pipeline runner wasn't running, although it is possible that the issue was with the in runner/queue. It wasn't the out runner/queue if welcome messages were being delivered.

The fact that the web server wasn't running is just a coincidence. It has nothing to do with Mailman core's processing of messages with the one exception that Mailman core needs to communicate with HyperKitty's API via mailman_hyperkitty, but this only affects archiving to HyperKitty and the addition of an Archived-At: header to messages.

...

In the absence of a hypothetical command: mailman status --verbose about which diagnostic tools or clues should I learn?

Logs are possibly the most useful. You might find something in mailman.log about the death of whatever runner had died. Looking at queues is also useful.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

justfixit＠marilynstrother.com

June 2023

5:49 a.m.

...

Logs are possibly the most useful. You might find something in mailman.log about the death of whatever runner had died. Looking at queues is also useful. FWIW, looking back at this time period in mailman.log the only instances of "runner exiting" are in groups, at the times when I had manually issued "mailman restart" to try a few different debug logging options. I.e. there was no single / multiple record of "runner exiting" from e.g. an unexpected program fault.

Mark Sapiro

3:16 p.m.

On 5/31/23 22:49, justfixit@marilynstrother.com wrote:

...

FWIW, looking back at this time period in mailman.log the only instances of "runner exiting" are in groups, at the times when I had manually issued "mailman restart" to try a few different debug logging options. I.e. there was no single / multiple record of "runner exiting" from e.g. an unexpected program fault.

When you issued a restart and the logs show runners restarting, did all the runners restart or was one or more missing. If you grep the log for runner started, you should see groups of messages like

Jun 01 08:00:27 2023 (21930) in runner started. Jun 01 08:00:27 2023 (21938) virgin runner started. Jun 01 08:00:28 2023 (21932) bounces runner started. Jun 01 08:00:28 2023 (21935) command runner started. Jun 01 08:00:28 2023 (21941) retry runner started. Jun 01 08:00:28 2023 (21934) archive runner started. Jun 01 08:00:28 2023 (21931) lmtp runner started. Jun 01 08:00:28 2023 (21942) pipeline runner started. Jun 01 08:00:29 2023 (21939) out runner started. Jun 01 08:00:29 2023 (21936) digest runner started. Jun 01 08:00:29 2023 (21937) nntp runner started. Jun 01 08:00:30 2023 (21940) task runner started.

and if any runners are sliced, there will be multiple entries for that runner, one per slice. Also, because of the way the rest runner is run, it will be in the list on a start but not on a restart.

If a runner had previously died so it was not running when you did the restart, it won't be started and will be missing from the above list. In order to restart runners which have died, you need to do mailman stop followed by mailman start as opposed to mailman restart

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Nelson Strother

2:58 p.m.

As you had suggested, the in runner/queue was missing in action. Given your insight about the limitations of mailman restart, I looked back at the log from the time of the preceding mailman start, and found that there was NOT a line such as: in runner started. in that instance! Watching the logs on each subsequent mailman start, the good news is that the in runner/queue has been started each time; the bad news is that slightly more than half of the time one or two runners are NOT shown as starting. The other runners/queues sometimes missing in action per the log file have been retry, task, nntp, archive, and pipeline.

No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via pip install mailman on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles that mailman start and mailman stop each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?

Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use mailman stop and mailman start instead?

On 6/1/23 11:16, Mark Sapiro wrote:

...

When you issued a restart and the logs show runners restarting, did all the runners restart or was one or more missing.

Mark Sapiro

6:47 p.m.

On 6/6/23 07:58, Nelson Strother wrote:

...

No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via pip install mailman on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles that mailman start and mailman stop each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?

If a runner has died, its death and a reason should be logged in Mailman's var/logs/mailman.log with a message similar to

Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)

This may not be the case if the runner is killed by the OS for an out of memory or similar reason. For this, look in syslog.

mailman stop can take a long time because it is waiting for a runner to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that issue was fixed long ago. I don't understand why mailman start would take more time than mailman restart. In fact, mailman restart effectively does stop and start, but only for those runners which are running.

Since you seem to frequently have missing runners, I suspect something like an OOM condition is causing the OS to kill them. Although, I wonder if you are correctly interpreting the logs. While the absence of the retry, task, nntp and archive runners might not be noticed except for messages not being archived, if either the in or pipeline runner is not running, no list posts will be processed.

...

Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use mailman stop and mailman start instead?

I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for this.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro

7:01 p.m.

On 6/6/23 11:47, Mark Sapiro wrote:

...

Although, I wonder if you are correctly interpreting the logs. While the absence of the retry, task, nntp and archive runners might not be noticed except for messages not being archived, if either the in or pipeline runner is not running, no list posts will be processed.

Ignore that. I was thinking of a different thread with different symptoms when I wrote it.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Nelson Strother

4:30 a.m.

Neither .../logs/mailman.log nor /var/log/syslog contain any recording of "Master detected ..." nor process exits other than at the time of mailman stop commands. Watching during these transitions with vmstat shows a decrease in free pages and an increase in swapping, but no definitive hint of the OOM reaper running.

I did not intend to imply that mailman start (predictably) consumes more time than mailman restart. Both are lengthy and variable; I think one cause for mailman stop consuming more time than usual is when mailman receives a message as it is trying to stop, which is understandable.

Interestingly if I use systemctl start mailman thus far the results are:

the wall clock duration is shorter than when mailman start is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner process starting from .../mailman/.local/bin/runner and ending with "flufl.lock._lockfile.NotLockedError: Already unlocked".

I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.

On 6/6/23 14:47, Mark Sapiro wrote:

...

On 6/6/23 07:58, Nelson Strother wrote:

...
No errors are being recorded in the mailman log files. This is GNU Mailman 3.3.8 via pip install mailman on Debian 5.10.179-1 (2023-05-12) running on a shared system where VMware gives this server enough cycles that mailman start and mailman stop each consume from 20 minutes to an hour of wall clock time, so I do not issue those commands recreationally, attempting to keep the system available for users. What should I do to help understand the cause for these failures?

If a runner has died, its death and a reason should be logged in Mailman's var/logs/mailman.log with a message similar to

Master detected subprocess exit (pid: 8617, why: SIGNAL 15, class: in, slice: 1/1)

This may not be the case if the runner is killed by the OS for an out of memory or similar reason. For this, look in syslog.

mailman stop can take a long time because it is waiting for a runner to stop. See https://gitlab.com/mailman/mailman/-/issues/255 but that issue was fixed long ago. I don't understand why mailman start would take more time than mailman restart. In fact, mailman restart effectively does stop and start, but only for those runners which are running.

Since you seem to frequently have missing runners, I suspect something like an OOM condition is causing the OS to kill them. Although, I wonder if you are correctly interpreting the logs. While the absence of the retry, task, nntp and archive runners might not be noticed except for messages not being archived, if either the in or pipeline runner is not running, no list posts will be processed.

...
Would not it be helpful for this limitation of restart to be included in: mailman restart --help with a suggestion to use mailman stop and mailman start instead?

I have just filed https://gitlab.com/mailman/mailman/-/issues/1082 for this.

Stephen J. Turnbull

2:14 p.m.

Nelson Strother writes:

...

I did not intend to imply that mailman start (predictably) consumes more time than mailman restart. Both are lengthy and variable;

You did say something about "30min" I believe. I don't understand why it would take anywhere near that long. For example, I'm working with a site with ~20,000 lists, and it only takes a few seconds.

Oh ... wait ... are you using Postfix as the MTA? The supplied Postfix MTA interface rebuilds the list aliases database on every (re)start, and that's an O(#lists^2) process the way Mailman does it. Postfix doesn't actually use the file to route mail, it always goes to lmtp:127.0.01 or something like that. So I configured Postfix to probe for existence of the list in Mailman's database.

I think Postfix has plugins for all Mailman-supported databases (PostgreSQL, MySQL/MariaDB, SQLite3). I can provide the changes needed for this for PostgreSQL right away, if you're a MySQL site I can give you the PostgreSQL config to give you the idea, but I may be sslow if you want advice about the MySQL config. It's pretty straightforward, though.

...

I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.

I don't understand them either. The only thing I can think of is that a runner may try to access Mailman core, timeout, and crash during the long startup.

At least the MTA reconfig would make the restart cycle a lot less painful. Note: this is NOT a Mailman-supported config (or even contrib quality) yet, but I'm pretty confident of the theory and it is working well in QA testing.

Regarding the "alpha" status of the config, one Mailman 3 instance can support multiple domains, but the Postfix docs say not to hit the database for the "domain exists" check. If you aren't going to add additional Mailman domains, then this doesn't matter (the domain list file produced by the stock Mailman postfix config will work), but if you do add or remove domains, you either need to configure Postfix to hit the database for the "domain exists" check too contrary to Postfix's advice, or you'd need to manage the domain list file by hand.

In any case, you could use this config for debugging, and go back to the tried and true aliases file approach for production. And as I indicated I'm somewhat available on a best-effort basis for trouble-shooting.

Nelson Strother

6:14 p.m.

Yes, Postfix is also running, but this system currently has a trivially small configuration (single domain, single digit number of lists). When the CPU is kept ~100% busy for (maybe once as few as a) dozen minutes to (usually 20 minutes, but rarely) an hour [1] during mailman start, essentially all of the CPU time is being consumed by the dozen or so mailman runner processes, so I do not think Postfix alias processing has much if anything to do with these delays.

[1] wall clock time; as an arbitrary indicator of how many cycles VMWare is giving to this server, here is what 7-zip observes:

x@localhost:~$ 7za b

7-Zip (a) [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21 p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,1 CPU Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz (406F0),ASM,AES-NI)

Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz (406F0) CPU Freq: - - - - - - - - -

RAM size: 425 MB, # CPU hardware threads: 1 RAM usage: 219 MB, # Benchmark threads: 1

Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS

22: 2909 99 2861 2831 | 29117 99 2502 2486 23: 2370 88 2741 2415 | 28833 99 2512 2496 24: 1937 79 2622 2083 | 27937 98 2501 2453 ---------------------------------- | ------------------------------ Avr: 89 2741 2443 | 99 2505 2478 Tot: 94 2623 2461 x@localhost:~$

On 6/8/23 10:14, Stephen J. Turnbull wrote:

...

Nelson Strother writes:

...
I did not intend to imply that mailman start (predictably) consumes more time than mailman restart. Both are lengthy and variable;

You did say something about "30min" I believe. I don't understand why it would take anywhere near that long. For example, I'm working with a site with ~20,000 lists, and it only takes a few seconds.

Oh ... wait ... are you using Postfix as the MTA? The supplied Postfix MTA interface rebuilds the list aliases database on every (re)start, and that's an O(#lists^2) process the way Mailman does it. Postfix doesn't actually use the file to route mail, it always goes to lmtp:127.0.01 or something like that. So I configured Postfix to probe for existence of the list in Mailman's database.

I think Postfix has plugins for all Mailman-supported databases (PostgreSQL, MySQL/MariaDB, SQLite3). I can provide the changes needed for this for PostgreSQL right away, if you're a MySQL site I can give you the PostgreSQL config to give you the idea, but I may be sslow if you want advice about the MySQL config. It's pretty straightforward, though.

...
I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.

I don't understand them either. The only thing I can think of is that a runner may try to access Mailman core, timeout, and crash during the long startup.

At least the MTA reconfig would make the restart cycle a lot less painful. Note: this is NOT a Mailman-supported config (or even contrib quality) yet, but I'm pretty confident of the theory and it is working well in QA testing.

Regarding the "alpha" status of the config, one Mailman 3 instance can support multiple domains, but the Postfix docs say not to hit the database for the "domain exists" check. If you aren't going to add additional Mailman domains, then this doesn't matter (the domain list file produced by the stock Mailman postfix config will work), but if you do add or remove domains, you either need to configure Postfix to hit the database for the "domain exists" check too contrary to Postfix's advice, or you'd need to manage the domain list file by hand.

In any case, you could use this config for debugging, and go back to the tried and true aliases file approach for production. And as I indicated I'm somewhat available on a best-effort basis for trouble-shooting.

Mailman-users mailing list --mailman-users@mailman3.org To unsubscribe send an email tomailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/ Archived at:https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/...

This message sent tojustfixit@marilynstrother.com

Mark Sapiro

5:02 p.m.

On 6/7/23 21:30, Nelson Strother wrote:

...

Interestingly if I use systemctl start mailman thus far the results are:

the wall clock duration is shorter than when mailman start is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner process starting from .../mailman/.local/bin/runner and ending with "flufl.lock._lockfile.NotLockedError: Already unlocked".

I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.

The NotLockedError is most likely because a lock has been set with a lifetime and its lifetime has passed and some other process has broken the lock before the original process tries to unlock it. This probably has something to do with the long times to do start/restart.

However, the only lock that should be involved is the master lock set by the master watcher, so I don't really understand what might be happening.

What is the full traceback from these errors?

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Nelson Strother

6:16 p.m.

Here are the (four identical, but for timestamps and process numbers, at a glance) tracebacks, and some related info:

--- from /var/log/syslog Jun 8 03:35:47 localhost mailman[13107]: Traceback (most recent call last): Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:35:48 localhost mailman[13107]: sys.exit(main()) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:35:48 localhost mailman[13107]: return self.main(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:35:48 localhost mailman[13107]: rv = self.invoke(ctx) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:35:48 localhost mailman[13107]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:35:48 localhost mailman[13107]: return __callback(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:35:48 localhost mailman[13107]: return f(get_current_context(), *args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:35:49 localhost mailman[13107]: initialize(config_file, verbose) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:35:49 localhost mailman[13107]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:35:49 localhost mailman[13107]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:35:49 localhost mailman[13107]: return database Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:35:49 localhost mailman[13107]: self.unlock() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:35:49 localhost mailman[13107]: raise NotLockedError('Already unlocked') Jun 8 03:35:49 localhost mailman[13107]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:35:49 localhost dhclient[633]: RCV: Advertise message on ens192 from fe80::250:56ff:fe95:b457. Jun 8 03:36:46 localhost mailman[13100]: Traceback (most recent call last): Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:36:46 localhost mailman[13100]: sys.exit(main()) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:36:46 localhost mailman[13100]: return self.main(*args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:36:46 localhost mailman[13100]: rv = self.invoke(ctx) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:36:46 localhost mailman[13100]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:36:46 localhost mailman[13100]: return __callback(*args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:36:46 localhost mailman[13100]: return f(get_current_context(), *args, **kwargs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:36:46 localhost mailman[13100]: initialize(config_file, verbose) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:36:46 localhost mailman[13100]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:36:46 localhost mailman[13100]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:36:46 localhost mailman[13100]: return database Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:36:46 localhost mailman[13100]: self.unlock() Jun 8 03:36:46 localhost mailman[13100]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:36:46 localhost mailman[13100]: raise NotLockedError('Already unlocked') Jun 8 03:36:46 localhost mailman[13100]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:36:56 localhost mailman[13109]: Traceback (most recent call last): Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:36:56 localhost mailman[13109]: sys.exit(main()) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:36:56 localhost mailman[13109]: return self.main(*args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:36:56 localhost mailman[13109]: rv = self.invoke(ctx) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:36:56 localhost mailman[13109]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:36:56 localhost mailman[13109]: return __callback(*args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:36:56 localhost mailman[13109]: return f(get_current_context(), *args, **kwargs) Jun 8 03:36:56 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:36:56 localhost postfix/anvil[13147]: statistics: max connection rate 1/60s for (smtp:163.123.143.10) at Jun 8 03:33:27 Jun 8 03:37:00 localhost postfix/anvil[13147]: statistics: max connection count 1 for (smtp:163.123.143.10) at Jun 8 03:33:27 Jun 8 03:37:00 localhost mailman[13109]: initialize(config_file, verbose) Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:37:01 localhost mailman[13109]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:37:01 localhost mailman[13109]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:37:01 localhost mailman[13109]: return database Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:37:01 localhost mailman[13109]: self.unlock() Jun 8 03:37:01 localhost mailman[13109]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:37:01 localhost mailman[13109]: raise NotLockedError('Already unlocked') Jun 8 03:37:01 localhost mailman[13109]: flufl.lock._lockfile.NotLockedError: Already unlocked Jun 8 03:37:01 localhost postfix/anvil[13147]: statistics: max cache size 1 at Jun 8 03:33:27 Jun 8 03:37:01 localhost mailman[13110]: Traceback (most recent call last): Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:37:01 localhost mailman[13110]: sys.exit(main()) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:37:01 localhost mailman[13110]: return self.main(*args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:37:01 localhost mailman[13110]: rv = self.invoke(ctx) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:37:01 localhost mailman[13110]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:37:01 localhost mailman[13110]: return __callback(*args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:37:01 localhost mailman[13110]: return f(get_current_context(), *args, **kwargs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:37:01 localhost mailman[13110]: initialize(config_file, verbose) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:37:01 localhost mailman[13110]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:37:01 localhost mailman[13110]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:37:01 localhost mailman[13110]: return database Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:37:01 localhost mailman[13110]: self.unlock() Jun 8 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:37:01 localhost mailman[13110]: raise NotLockedError('Already unlocked') Jun 8 03:37:01 localhost mailman[13110]: flufl.lock._lockfile.NotLockedError: Already unlocked --- note that processes 13100, 13107, 13109, and 13110 are the ones missing from the sequential process numbers shown by systemctl status mailman below, were presumably momentarily the missing command, digests, retry, and virgin runners ...

-- all of the recent lines from /var/tmp/mailman/logs/mailman.log : Jun 08 03:24:17 2023 (4268) Master stopped Jun 08 03:30:18 2023 (13093) Master started Jun 08 03:43:31 2023 (13108) task runner started. Jun 08 03:43:52 2023 (13108) Task runner evicted 0 expired pendings Jun 08 03:43:52 2023 (13108) Task runner deleted 0 orphaned workflows Jun 08 03:43:53 2023 (13108) Task runner deleted 0 orphaned requests Jun 08 03:43:54 2023 (13108) Task runner deleted 0 orphaned messages Jun 08 03:43:54 2023 (13108) Task runner evicted expired cache entries Jun 08 03:44:17 2023 (13101) in runner started. Jun 08 03:44:23 2023 (13105) pipeline runner started. Jun 08 03:44:54 2023 (13103) nntp runner started. Jun 08 03:45:59 2023 (13098) archive runner started. Jun 08 03:46:06 2023 (13104) out runner started. Jun 08 03:46:23 2023 (13099) bounces runner started. Jun 08 03:46:43 2023 (13106) rest runner started. [2023-06-08 03:46:43 +0000] [13106] [INFO] Starting gunicorn 20.1.0 [2023-06-08 03:46:43 +0000] [13106] [INFO] Listening at: http://127.0.0.1:8001 (13106) [2023-06-08 03:46:43 +0000] [13106] [INFO] Using worker: sync [2023-06-08 03:46:44 +0000] [13345] [INFO] Booting worker with pid: 13345 [2023-06-08 03:46:44 +0000] [13346] [INFO] Booting worker with pid: 13346 Jun 08 03:46:46 2023 (13102) lmtp runner started. Jun 08 04:44:04 2023 (13108) Task runner evicted 0 expired pendings Jun 08 04:44:04 2023 (13108) Task runner deleted 0 orphaned workflows Jun 08 04:44:05 2023 (13108) Task runner deleted 0 orphaned requests Jun 08 04:44:05 2023 (13108) Task runner deleted 0 orphaned messages Jun 08 04:44:05 2023 (13108) Task runner evicted expired cache entries {with these last five lines being repeated hourly since ...} --

root@localhost:~# systemctl status mailman ● mailman.service - GNU Mailing List Manager Loaded: loaded (/etc/systemd/system/mailman.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2023-06-08 03:30:18 UTC; 13h ago Process: 13092 ExecStart=/var/tmp/mailman/.local/bin/mailman --config /var/tmp/mailman/var/etc/mailman.cfg start (code=exited, status=0/SUCCESS) Main PID: 13093 (python3) Tasks: 17 (limit: 465) Memory: 73.5M CPU: 15min 23.511s CGroup: /system.slice/mailman.service ├─13093 /usr/bin/python3 /var/tmp/mailman/.local/bin/master -C /var/tmp/mailman/var/etc/mailman.cfg ├─13098 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=archive:0:1 ├─13099 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=bounces:0:1 ├─13101 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=in:0:1 ├─13102 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=lmtp:0:1 ├─13103 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=nntp:0:1 ├─13104 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=out:0:1 ├─13105 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=pipeline:0:1 ├─13106 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1 ├─13108 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=task:0:1 ├─13345 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1 └─13346 /usr/bin/python3 /var/tmp/mailman/.local/bin/runner -C /var/tmp/mailman/var/etc/mailman.cfg --runner=rest:0:1

Jun 08 03:37:01 localhost mailman[13110]: initialize_2(propagate_logs=propagate_logs) Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 08 03:37:01 localhost mailman[13110]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 08 03:37:01 localhost mailman[13110]: return database Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 08 03:37:01 localhost mailman[13110]: self.unlock() Jun 08 03:37:01 localhost mailman[13110]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 08 03:37:01 localhost mailman[13110]: raise NotLockedError('Already unlocked') Jun 08 03:37:01 localhost mailman[13110]: flufl.lock._lockfile.NotLockedError: Already unlocked root@localhost:~#

On 6/8/23 13:02, Mark Sapiro wrote:

...

On 6/7/23 21:30, Nelson Strother wrote:

...
Interestingly if I use systemctl start mailman thus far the results are: process starting from .../mailman/.local/bin/runner and ending with

the wall clock duration is shorter than when mailman start is issued from the command line, and -- either that all of the runners remain present once they are started, -- or I can see in syslog a traceback from each missing runner

"flufl.lock._lockfile.NotLockedError: Already unlocked".

I do not yet understand how to make use of these clues, but at least one can see an epitaph from each deceased process.

The NotLockedError is most likely because a lock has been set with a lifetime and its lifetime has passed and some other process has broken the lock before the original process tries to unlock it. This probably has something to do with the long times to do start/restart.

However, the only lock that should be involved is the master lock set by the master watcher, so I don't really understand what might be happening.

What is the full traceback from these errors?

Mark Sapiro

7:47 p.m.

On 6/8/23 11:16, Nelson Strother wrote:

...

Here are the (four identical, but for timestamps and process numbers, at a glance) tracebacks, and some related info:

--- from /var/log/syslog Jun 8 03:35:47 localhost mailman[13107]: Traceback (most recent call last): Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/bin/runner", line 8, in <module> Jun 8 03:35:48 localhost mailman[13107]: sys.exit(main()) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__ Jun 8 03:35:48 localhost mailman[13107]: return self.main(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main Jun 8 03:35:48 localhost mailman[13107]: rv = self.invoke(ctx) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke Jun 8 03:35:48 localhost mailman[13107]: return ctx.invoke(self.callback, **ctx.params) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke Jun 8 03:35:48 localhost mailman[13107]: return __callback(*args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func Jun 8 03:35:48 localhost mailman[13107]: return f(get_current_context(), *args, **kwargs) Jun 8 03:35:48 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/bin/runner.py", line 160, in main Jun 8 03:35:49 localhost mailman[13107]: initialize(config_file, verbose) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 231, in initialize Jun 8 03:35:49 localhost mailman[13107]: initialize_2(propagate_logs=propagate_logs) Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/core/initialize.py", line 189, in initialize_2 Jun 8 03:35:49 localhost mailman[13107]: config.db = getUtility(IDatabaseFactory, utility_name).create() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/mailman/database/factory.py", line 60, in create Jun 8 03:35:49 localhost mailman[13107]: return database Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 470, in __exit__ Jun 8 03:35:49 localhost mailman[13107]: self.unlock() Jun 8 03:35:49 localhost mailman[13107]: File "/var/tmp/mailman/.local/lib/python3.9/site-packages/flufl/lock/_lockfile.py", line 420, in unlock Jun 8 03:35:49 localhost mailman[13107]: raise NotLockedError('Already unlocked') Jun 8 03:35:49 localhost mailman[13107]: flufl.lock._lockfile.NotLockedError: Already unlocked

The runners are all trying to access the Mailman database. During the access <https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/database/factor...> the database is locked. This code is executed by each runner and the lock keeps them from stepping on each other. It should only take milliseconds to execute, not minutes. The lock is created with the DEFAULT_LOCK_LIFETIME which is 15 seconds. The runner which has the lock is taking longer than 15 seconds in the create code and another waiting runner breaks the lock causing this error

I would look to your database manager for the issue.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Nelson Strother

8:32 p.m.

Well, this is using the default libsqlite 3.34.1 . One improvement is to use write-ahead logging: mailman@localhost:~$ sqlite3 /var/tmp/mailman/data/mailman.db SQLite version 3.34.1 2021-01-20 14:10:07 Enter ".help" for usage hints. sqlite> PRAGMA journal_mode = WAL; wal sqlite> This setting is persistent, and based on a small sample this seems to dramatically speed up the completion of root@localhost:~# systemctl stop mailman ## consumes <2 minutes with journal_mode = WAL; root@localhost:~# ## instead of up to 20 minutes with the default journal_mode = delete But this does not make a big change the completion time for: root@localhost:~# systemctl start mailman

Since there is no point in busy waiting when running on a processor with only one core (as this virtual server), the next setting that I would like to try is change the sqlite busy_timeout from the default of none to 1 or 2 seconds, e.g.: sqlite> PRAGMA busy_timeout = 2000; 2000 sqlite> However, this setting is not persistent, and thus needs to be applied to each connection (i.e. tens of database connections are made during systemctl start mailman). I have not yet found where to insert this directive in mailman's database wrapper for SQLite. Might someone who has read more of the source be able to offer me a clue?

On 6/8/23 15:47, Mark Sapiro wrote:

...

I would look to your database manager for the issue.

Stephen J. Turnbull

3:07 p.m.

Nelson Strother writes:

...

I have not yet found where to insert this directive in mailman's database wrapper for SQLite. Might someone who has read more of the source be able to offer me a clue?

The only place I can think of offhand is

mailman/src/mailman/database/sqlite.py

which I think contains initialization code for the sqlite database. You may also want to look at the base.py and factory.py files to see if they invoke any hooks that you might use. sqlite.py is pretty short, there may not be a model statement for you there, but there might be models in some of the other database's files. I'm not sure if the alembic subdirectory would be helpful, I would guess not. AFAIK it works exclusively on the Mailman model to SQL side, rather than the SQL to implementation side.

Finally there may be some hints in

mailman/src/mailman/config/schema.cfg

if there's something like [database.sqlite] section, or other [database*] sections.

783

Age (days ago)

793

Last active (days ago)

List overview

Download

17 comments

5 participants

participants (5)

justfixit＠marilynstrother.com
Mark Sapiro
Nelson Strother
Odhiambo Washington
Stephen J. Turnbull

Mailman sends Welcome emails, but does not send [List] postings emails

tags

participants (5)