Re: Importing mbox files into archive defect with lines with From
Hi Stephen
On 19/09/2023 19:59, Stephen J. Turnbull wrote:
Did you intend to send this to me only?
Nope, apologies. I'm happy to add the list back in - I thought I hit reply-to-list.
Alex Schuilenburg writes:
Yes, I retested with from the git head as well. Same result. It sees the line
From 1st April 2019, our turnover has exceeded the threshold for
as the start of a new message and reports no anomolies although there is clearly one.
If there is an empty line preceding that one, and the "F" is in the first column (no marginal whitespace), that is considered correct behavior for processing an mbox file.
As an email separator agreed, provided the body has all "First " (preceded by a blank line and nothing preceding on the same line) suitably escaped.
If you don't like it, don't use the mbox format. The maildir message-per-file format is widely supported, and if you want a folder-per-file rather than message-per-file format, MMDF is fairly widely supported, and does not suffer from the ambiguities of mbox. All three (with variations) are supported by recent Python 3.
Understood. My issue is the lists were imported from MM2.1 in 2020 into MM3.2.1 onto hyperkitty+mariadb/mysql.
So the current installation has no mbox files - archives are stored in a mysql database.
The old MM3 installation I am migrating away from is obviously broken in that its mbox archive when downloaded is not escaped,
That's interesting. I presume that's because HyperKitty saves messages in a database rather than as mbox files, and the old mbox exporter just flattened and concatenated the messages. Or perhaps that also uses mailbox.mbox and old enough versions of that didn't From-stuff.
Spot on.
I have to move the lists onto a new Debian 12 server using the native mailman 3.3.8 & mailman-web 0+20200530-2 packages. I tried dropping in the old mailman3 database under the new software but that did not work. Instead I manually imported the old mailman3 data directly into the new mailman3 database as, after inspections, there were no new tables and only a couple of additional fields which have suitable defaults. So dumped the mailman3 data with
mysqldump --no-create-info --no-create-db --disable-keys --complete-insert --ignore-table=mailman3.alembic_version mailman3
and simply imported the dump into the new schema. OK so far. The lists showed up in postorius, obviously without archives in hyperkitty.
The normal way to upgrade a HyperKitty archive is to do nothing, just upgrade the software. I guess you moved to a new host and deleted the database? The preferred way is to dump the database to SQL, and then load it in to the new database directly rather than downloading the mbox files and importing. No ambiguity and much faster.
Thats what I thought initially, but that failed as per https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/....
As my old installation appears to have the django_migrations table inconsistent with the state of the database, and Debian have been unresponsive so far.
Unfortunately the same mailman3-style manual import of mailman3web from old mailman3web db into the new db was not possible. There were additional tables and fields that needed values rather than defaults. Instead I opted to download the mbox exports from the old installation via the web interface and import them into the new installation for expediency.This is where I was when I posted the message.
However, the old installation's mbox archive export (from the database) was problematic (e.g. 5min timeout) but after getting around those issues I ended up with some broken mbox files (e.g. the "From 1st April ..." line within a message, resulting in hyperkitty_import failures).
although I tested downloading this thread archive and it is clearly escaped. Unfortunately I cannot tell if hyperkitty_import works using the git head since I am restricted to using the version provided with Debian 12.
In July it worked for me on a site with 5300 archived lists. I assume some of them had From-stuffed lines. Can't be sure, most of the posts are machine-generated, but lines beginning with "From " are pretty common in natural English.
I'd be surprised if there were not any. I have 2003 lists and several occurrences.
There are indeed two versions though of mailbox.py on the VM:
/usr/lib/python3/dist-packages/mailman/utilities/mailbox.py: from mailman3 (3.3.8-2~deb12u1) /usr/lib/python3.11/mailbox.py from libpython3.11-stdlib (3.11.2-6)
The first applies one small change to the second (a so-called "monkey-patch") for use by Mailman core. It should only be visible to Python by the name "mailman.utilities.mailbox". The HyperKitty utilities import the name "mailbox". If overwriting the first with the second changes HyperKitty's behavior, something is wrong with sys.path.
Then I guess that is the case in Debian 12.
the latter of which appears far more substantial , so I dropped that module over the former and reran hyperkitty_import over the '>' escaped mbox, and while it did import without error, it did leave the escape in place (i.e. the 'From ...' was quoted when viewed).
This is considered correct behavior. It is not possible to determine whether the escape was in the original, or added by a receiving MTA. Better to leave it.
I thought that ">From " would be escaped to ">>From ", and so on, so the escape could easily be reversed when imported. I tested exports from the mailman-users lists and lines beginning "From " (preceded by a blank line) are escaped to ">From ", so incorrectly figured this would be unescaped by hyperkitty_import. After all, I would expect that an export of the archive to mbox, followed by a delete of the archive, followed by a hyperkitty_import of the archive, should leave you at the same place. Not with ">From " escapes in the new archives. In fact I also had a number of messages with "Message-ID: <>" and worse: all messages with attachments had the text/plain content empty.
So mbox exports from MM 3.2.1 on Debian 10 (using hyperkitty+mysql) are broken.
> The unescaped mbox import died in the same way.
As expected.
Anyhow, thanks for your suggestion. For now I can stick with a manual repair and spaced escape of From_.
If the old database is still available, I recommend dumping that and loading it into a fresh version of the DBMS. ...
Thanks for the pointer. As I had already done the same with mailman3, so repeated the excercise. The following dump and import worked.
oldhost> mysqldump --no-create-info --no-create-db --disable-keys --complete-insert mailman3web > mailman3web.sql
newhost> mysql MariaDB [(none]> use mailman3web MariaDB [mailman3web]> source mailman3web.sql
Thanks again
-- Alex
Alex Schuilenburg via Mailman-users writes:
Nope, apologies.� I'm happy to add the list back in - I thought I hit reply-to-list.
No apologies necessary.
As an email separator agreed, provided the body has all "First " (preceded by a blank line and nothing preceding on the same line) suitably escaped.
That's not the way this works. You don't get to choose, only to defend your system. See https://www.jwz.org/doc/content-length.html
I have to move the lists onto a new Debian 12 server using the native mailman 3.3.8 & mailman-web 0+20200530-2 packages.
I think the relevant package version is HyperKitty's. Mailman-Web should just be a wrapper around HyperKitty and Postorius.
The preferred way is to dump the database to SQL, and then load it in to the new database directly rather than downloading the mbox files and importing.
Thats what I thought initially, but that failed as per https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/....
As my old installation appears to have the django_migrations table inconsistent with the state of the database, and Debian have been unresponsive so far.
Hm. I think it's more likely that the load overwrote the django_migrations table with the old migrations table, but the Debian-supplied database already had migrations applied on the assumption that you're either creating a new Mailman instance or upgrading in place. When you load the dumped database, that is probably smart enough to delete tables before creating them (or perhaps you just get lucky that the load doesn't try to delete or rename columns) BUT new tables do NOT get that treatment. They just sit around waiting for you to apply the migration that creates them and "what migration doing?" KA-BOOM.
I suspect that "DROP DATABASE mailmanweb;" and then loading the dumped
database, followed by mailman-web migrate
will work. (Usual caveat
of you should have a backup onsite, a backup in a bank vault, and a
backup store on the dark side of the moon before trying this!)
Thank you for going to Debian first (at about the same time is fine), by the way. We really appreciate that. We try hard to make sure Mailman works in all *our* common use cases, but distros have a different set. It's very common that a distro will do something that makes sense for their usual use cases that just fail badly for cases they didn't anticipate.
Then I guess that is the case in Debian 12.
Maybe, it's often hard to tell where distros go wrong. If my guessalasys above is correct, it's simply the assumption that the user is either doing a greenfield installation or an upgrade in place. Surely those are the great majority of cases.
I thought that ">From " would be escaped to ">>From ", and so on, so the escape could easily be reversed when imported.
Ah, you are an honest person. Do not commit crimes, my friend, you will get caught. More devious thinkers quote messages by prepending only ">" to the line. Or, knowing about From-stuffing when sending signed mail then might pre-stuff From lines so that signature validation succeeds by default. Either way if you unstuff you will break the message. Maybe ChatGPT-10 will get it right. ;-)
The only way to win is to not play the mbox game.
After all, I would expect that an export of the archive to mbox, followed by a delete of the archive, followed by a hyperkitty_import of the archive, should leave you at the same place.
You would expect, for sure. You would be wrong, because mbox is a lossy format by design. (Or by lack of design, if you prefer.)
Not with ">From " escapes in the new archives.� In fact I also had a number of messages with "Message-ID: <>" and worse: all messages with attachments had the text/plain content empty.
I don't EVEN want to think why that might be.
The following dump and import worked.
oldhost> mysqldump --no-create-info --no-create-db --disable-keys --complete-insert mailman3web > mailman3web.sql
newhost> mysql MariaDB [(none]> use mailman3web MariaDB [mailman3web]> source mailman3web.sql
Yeah!!
Except I forgot how to update the FAQ. Now I have to learn again! :-)
Steve
Hi again.
On 20/09/2023 17:59, Stephen J. Turnbull wrote:
[...] I suspect that "DROP DATABASE mailmanweb;" and then loading the dumped database, followed by
mailman-web migrate
will work. (Usual caveat of you should have a backup onsite, a backup in a bank vault, and a backup store on the dark side of the moon before trying this!)
That exactly what I did in my initial attempt. It was a straight dump and restore which does as its first cmd "DROP DATABASE mailmanweb;"
As per https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/..., the migrate error was the table django_content_type already exists. Indeed it exists in the old database, but other tables do not (list owner & moderator, and other social media tables).
So if a mailman-web dpkg update also updated the database, then Debian's update clearly did not properly log the changes. Or maybe they were logged elsewhere like in /var/lib/mailman3 in the old machine which also needed to be carried over to the new - so updating on a new machine involved copying over not just the database, but some other files that provide db state?
[...]
After all, I would expect that an export of the archive to mbox, followed by a delete of the archive, followed by a hyperkitty_import of the archive, should leave you at the same place.
You would expect, for sure. You would be wrong, because mbox is a lossy format by design. (Or by lack of design, if you prefer.)
I'll stick with "whatever mechanism you provide to produce an export should include a mechanism that can import it".
It's like: Here is a way you can back up your data (through an mbox export), but by the way, the format we back it up in is broken so restores (mbox import) are not guaranteed.
That's just broken logic IMHO, especially since the "fix" in this case is so simple. Yes, mbox is broken in a multitude of ways, but if hyperkitty is going to produce an mbox export of a list that is stored in a database (i.e. it has to construct the mbox), hyperkitty ought to do so in a format that it can at least re-import. My 2 pence worth...
Of course this may all work for you, if you are using other backend combinations...
I should mention that I had to re-apply and rework some of my old patches back into Debian's libs. For example, update_index_one_list fails when using xapian with "Key too long" errors, so I had to re-apply the patch I contributed in https://gitlab.com/mailman/hyperkitty/-/issues/322.
Sad that I am now of the age where I search for an issue only to discover I provided a patch for it 2 years ago ;-(
-- Alex
Alex Schuilenburg via Mailman-users writes:
If a mailman-web dpkg update also updated the database, then Debian's update clearly did not properly log the changes.
If you can reproduce any of these problems in a Mailman install from source (preferably contemporary release tags) or PyPI, we have something to talk about. I can't reproduce them, though. The evidence is very strong that the problems are in Debian packaging, or perhaps even vendored patches. Please talk to Debian about them.
Updating on a new machine involved copying over not just the database, but some other files that provide db state?
It shouldn't. That's the point of using a database, have the dynamic data in one place.
I should mention that I had to re-apply and rework some of my old patches back into Debian's libs.
https://gitlab.com/mailman/hyperkitty/-/issues/322.
Sad that I am now of the age where I search for an issue only to discover I provided a patch for it 2 years ago ;-(
In that GitLab issue, IMO Mark went above and beyond, identifying an upstream issue previously reported *and* two forks that address it. That's as far as we can usefully go.
I understand that you're frustrated because Mailman wasn't working as you expected it to work, but at some point you need to take the issues to projects that can solve them.
participants (2)
-
Alex Schuilenburg
-
Stephen J. Turnbull