Importing mbox files into archive defect with lines with From
A curious problem, but seems to be well defined. I am importing mbox format files with 25 years worth of archives in 250 files from a Listserv list into hyperkitty. All is going swimmingly, except I get lots of partial emails with sender of None put into the current month, with text that has clearly been truncated. As I look at each one, I see that the line in the imported text is immediately after a line starting with 'From'. Clearly, 'From:' is an RCS-82 header, but 'From ' is not, and yet there is much evidence here that this is messing up the import.
Is this a known bug? If not, how do I report it?
My goodness - have just tested with a small Python program. This is a bug in Python. Parsing a single message mbox with a body line somewhere starting with 'From' breaks it into two fragmentary messages. Wow.
Off then to the Python developers world, although I find it hard to believe I could have found such a thing, yet the evidence is there.
On Fri, 21 Jun 2019 09:22:46 +0100, <andrew.bernard@gmail.com> wrote:
My goodness - have just tested with a small Python program. This is a
bug in Python. Parsing a single message mbox with a body line somewhere
starting with 'From' breaks it into two fragmentary messages. Wow.Off then to the Python developers world, although I find it hard to
believe I could have found such a thing, yet the evidence is there.
Won't they just tell you that a message body line starting "From " should
be escaped as ">From " in the mbox file ..?
Malcolm.
-- Malcolm Austen <malcolm.austen@weald.org.uk>
andrew.bernard@gmail.com writes:
My goodness - have just tested with a small Python program. This is a bug in Python. Parsing a single message mbox with a body line somewhere starting with 'From' breaks it into two fragmentary messages. Wow.
Almost certainly not a bug in Python. That's the mbox format. If you're into scatological software analysis, here's one of the best flamers in the business doing his thing on mbox:
https://www.jwz.org/doc/content-length.html
If you've been shocked by the date on that post, no, *nothing* has changed since then. Jamie explains why.
You may be dealing with MTA/MDA software using Content-Length, in which case I'm pretty sure the email package has an option for dealing with it. In that case, you should report an RFE to us and we'll try to figure out how to recognize that file format, and automatically deal with it. If not, the software is too unusual for us to worry about, and we'll help you work around the problem.
Regards,
- Stephen J. Turnbull:
Almost certainly not a bug in Python.
Agreed.
If you've been shocked by the date on that post, no, *nothing* has changed since then. Jamie explains why.
Well, there's https://tools.ietf.org/html/rfc4155 , but the amount of information presented is, as expected, very limited. I like the term "anecdotally documented" in that RFC. ;-)
Having had a shot at writing a robust mbox parser/converter myself, I came across numerous different formats/flavours, and frankly "mbox" is best avoided.
-Ralph
Ralph Seichter wrote:
Having had a shot at writing a robust mbox parser/converter myself, I came across numerous different formats/flavours, and frankly "mbox" is best avoided.
What is the recommended file format for mail archives to share?
I ask because I often see mailman mailing list archives available as some format of mbox file. I like this feature. I want to quickly import a bunch of archived messages into my account so that I can read old threads and get more discussion context anytime I need it.
On 6/21/19 8:38 PM, J.B. Nicholson wrote:
What is the recommended file format for mail archives to share?
I ask because I often see mailman mailing list archives available as some format of mbox file. I like this feature. I want to quickly import a bunch of archived messages into my account so that I can read old threads and get more discussion context anytime I need it.
Some flavor of mbox is ubiquitous in the *nix world. The problem with this format is lines in the message body beginning with 'From '. Most modern software that writes mbox format files avoids this problem by somhow mangling such lines. The Python mailbox package will prefix such lines with '>' when writing a message to a mbox.
Before Mailman 3, Mailman kept a cumulative archive in mbox format. Even after pipermail was included, the cumulative mbox was normally written and could be used to rebuild the pipermail archive or move the archive elsewhere. The problem is that the Python mailbox package didn't always mangle 'From ' lines in the body, so archive mboxes that go back many years are likely to have problems.
A more robust mailbox format which is still a single file is MMDF format. This is similar to mbox with the addition that each message is preceded and followed by a line consisting of exactly four control-A bytes. I.e., there is one such line at the beginning and at the end of the file and two such lines between each message. Mailman 3 uses this format for the mailboxes that accumulate messages for digests.
However, lot's of things including HyperKitty's importer and Mailman 2's bin/arch tool require input in mbox format.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
That's funny that this should come up, as I have just been asking questions in the list about importing mbox format files into hyperkitty, just yesterday.
The archives I needed to import went back 30 years. I thought I found a bug in Python when I discovered that posts that contain '^From' cause the Python mbox library code to choke and throw bogus cut messages. It turns out this is a well known issue with the very loose mbox format, of which there are dozens of species known to science. The problem is solved by mangling the files with a pre-process script. I changed all body text with '^From' to '^ From' and hyperkitty and Python will behave correctly. Not ideal but better than dropping messages from the old archive altogether.
This issue could very well go in an FAQ for mailman. Can I add a topic somehow? Many others will hit the same problem.
Andrew
On 6/21/19 10:05 PM, Andrew Bernard wrote:
This issue could very well go in an FAQ for mailman. Can I add a topic somehow? Many others will hit the same problem.
There is a FAQ article at <https://wiki.list.org/x/4030689>. If you wish to edit that or create a new article, see the first paragraph at <https://wiki.list.org/> about obtaining write permission.
There is also some documentation of hyperkitty_import at <https://hyperkitty.readthedocs.io/en/latest/install.html#importing-the-current-archives>, the source for which is at <https://gitlab.com/mailman/hyperkitty/blob/master/doc/database.rst>. You could create a merge request to update that.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
- J. B. Nicholson:
What is the recommended file format for mail archives to share?
Maildir [1] has been around for decades and is widely supported. Other than mbox it stores one email message per file, so there is no problem identifying individual messages.
On my servers, incoming mail is delivered to Dovecot (configured to use maildir storage). Users can access via IMAP and Notmuch [2], the latter directly accessing maildir. Notmuch is what I am using right now. In Emacs. Because I can. ;-)
BTW, Notmuch does not care much about the directory structure, but attempts to parse and file not explicitly added to an ignore list as email. You can thus add files/directories not strictly adhering to the official structure and Notmuch will happily index them.
-Ralph
[1] https://en.wikipedia.org/wiki/Maildir [2] https://notmuchmail.org
J.B. Nicholson writes:
Ralph Seichter wrote:
Having had a shot at writing a robust mbox parser/converter myself, I came across numerous different formats/flavours, and frankly "mbox" is best avoided.
What is the recommended file format for mail archives to share?
This depends on requirements you haven't stated. If you have a "From munging" mail system that creates mboxes, your messages are already From-munged, and there's little point in going beyond the traditional ^From_-delimited mbox, which is quite safe and universally understood. (Unless for the sake of prettification you're willing to clean up the munged Froms pretty much by hand, as they sometimes can't be distinguished from quoting, and the fact that they're often quoted makes finding them non-trivial, too. Then go ahead, and when done use a reliable format.)
If you have a message-at-a-time system (as Mailman does, since it receives messages over a pipe -- Mailman 2, or LMTP -- Mailman 3), you have at least two good choices for single-file distribution:
MMDF
put it in a maildir and zip it up (for maximum portability; obviously if you know the target host(s), tar + compressor du jour is good too)
and one mediocre (but in some sense portable) one:
- From-munged mbox
Obviously MMDF and mbox can be compressed if you like. Note that both MMDF and maildir can be considered message-at-a-time, and each converted to the other. How you sort the stream is up to you, again depending on unstated requirements.
All of the above are widely understood by MUAs, and easily interconverted. If you have some other format, such as Content-Length, it's best to convert to one of the reliable (ie, well-defined and usually correctly implemented) ones, maildir or MMDF.
There are other reliable formats, such as Babyl as used by Emacs's RMail MUA, but simple is best.
Steve
On 6/21/19 12:57 AM, andrew.bernard@gmail.com wrote:
A curious problem, but seems to be well defined. I am importing mbox format files with 25 years worth of archives in 250 files from a Listserv list into hyperkitty. All is going swimmingly, except I get lots of partial emails with sender of None put into the current month, with text that has clearly been truncated. As I look at each one, I see that the line in the imported text is immediately after a line starting with 'From'. Clearly, 'From:' is an RCS-82 header, but 'From ' is not, and yet there is much evidence here that this is messing up the import.
Is this a known bug? If not, how do I report it?
As other responses in this thread have pointed out, it is a well known issue with mbox format in that lines beginning with 'From ' are message separators.
There is a (Python 2) script at <https://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/bin/cleanarch> that will process a mbox and prefix with '>' all lines beginning with 'From ' that don't look like real 'From ' lines or which aren't immediately followed by a line that looks like a valid header line.
It isn't perfect because it won't handle a message that contains in it's body a copy of another message containing a Unix From_ line, but it can helm with most unescaped From_ lines.
If you want to use that script, replace lines 55 and 56
import paths from Mailman.i18n import C_
with
def C_(s): return s
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Thanks all for pointing out the limitations of the mbox format, of which I was unaware.
It was easy for me to add a line to my preprocessing Python to put a space in front of the From lines in question. I prefer this instead of '>' as that tends to indicate quotation. which would be confusing. Having to mangle the input messages, using a space seems to be the least intrusive. I have imported 25 years of messages, and there were hundreds of occurrences of this. Ironically, the preprocessing has to be done on mbox format archives from Listserv which depart from the rough mbox standard. I was aware of the lack of uniformity of what people think mbox is, but not the From gotcha.
Andrew
Prefixing lines within an email body that begin "From " with '>' does not work with python3-mailman-hyperkitty 1.2.1-1 (debian) - you may as well delete the remainder of the message.
Simple example: I have a message that contains the line: From 1st April 2019, our turnover has exceeded the threshold for
Unfortunately this breaks hyperkitty_import which sees this as the start of a new message (FYI: check_hk_import reports n+1 messages in the mbox which only contains n messages, and hyperkitty_import fails as there are no subsequent header fields following the "From " such as message-id etc).
Prefixing the line with '>' allows the hyperkitty_import to partially succeed - partially as the line and everything below it is omitted from message in the archive. Presumably because the archiver thinks the remainder is a quote which it omits?
Prefixing with a single whitespace was the best work-around, since the renderer ignores it when viewing the message from the archive.
Alex Schuilenburg via Mailman-users writes:
Prefixing lines within an email body that begin "From " with '>' does not work with python3-mailman-hyperkitty 1.2.1-1 (debian) -
1.2.1 is very old, although check_hk_import has had only an irrelevant change since then. I believe the same is true of hyperkitty_import. Although changes are somewhat more extensive, the basic strategy of using mailbox.mbox to parse hasn't changed.
you may as well delete the remainder of the message.
Simple example: I have a message that contains the line:
From 1st April 2019, our turnover has exceeded the threshold for
Unfortunately this breaks hyperkitty_import which sees this as the start of a new message (FYI: check_hk_import reports n+1 messages in the mbox which only contains n messages, and hyperkitty_import fails as there are no subsequent header fields following the "From " such as message-id etc).
Are you sure you're reporting this correctly? I ask because I'd like to help but this doesn't look like it can be a Mailman problem to me. check_hk_import is a very simple script which merely reads an mbox file into a stdlib mailbox.mbox object. A quick test with a check_hk_import I have lying around (HyperKitty 1.3.8b1) works fine, no anomolies with From-stuffed lines.
hyperkitty_import is far more complex, but AFAICS in a quick look it also depends on the stdlib mailbox module to parse the mbox file. If the stdlib mailbox module is broken, that's going to break the world (which is what you see).
Perhaps Mark or somebody has a better idea, but this sounds to me like you have another mailbox.py on your pythonpath, perhaps from Python 2. If not, your Python 3 installation may be broken.
Steve
On 9/19/23 2:25 AM, Alex Schuilenburg via Mailman-users wrote:
Prefixing the line with '>' allows the hyperkitty_import to partially succeed - partially as the line and everything below it is omitted from message in the archive. Presumably because the archiver thinks the remainder is a quote which it omits?
Prefixing with a single whitespace was the best work-around, since the renderer ignores it when viewing the message from the archive.
As discussed earlier in this thread, issues with unescaped 'From ' lines in message bodies exist because of the format of *nix mbox files where lines beginning with 'From ' are message separators by definition.
Escaping such lines by prefixing them with >
is a defacto standard.
As far as this causing HyperKitty to ignore the rest of the message is concerned, I do not see that with current HyperKitty.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (8)
-
alexs@ecoscentric.com
-
Andrew Bernard
-
andrew.bernard@gmail.com
-
J.B. Nicholson
-
Malcolm Austen
-
Mark Sapiro
-
Ralph Seichter
-
Stephen J. Turnbull