tlhackque writes:
Not sure why you believe this. RFC2822 3.6.4 defines References for e-mail.
Ouch! I went to RFC 5322, searched for "References" and was taken to "Section 7. References". Evidently it was already positioned at the end of the appendices. :-(
As I wrote in later post, the message-ID syntactically requires the <>. (just one set).
msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
Look at that for the full syntax and reference.
Sure. It's messy because the id-left could be a "local-part", which could be almost anything if quoted, and the id-right could be a domain literal, which is a little more restricted.
2822 does say:
Therefore, trying to form a "References:" field for a reply that has multiple parents is discouraged and how to do so is not defined in this document.
But programmers are rarely discouraged - with GUIs, it's pretty easy to intuit that one might like to check the boxes on several branches of a thread and respond "Of course, you're all right - see my cat photo". :-)
Oh, I've done this by hand! ;-)
5537 tries to make life easier - until you get to:
5322 has adopted the 5537 language for the abstract construction of the field. It doesn't say anything at all about the logical length issue and trimming.
The best one can do is to come up with a "good enough" approximation in the available time, and tinker with/improve it until bored.
Which is what Jamie Zawinski did. His algorithm (adopted by IMAP as the standard for threading IMAP servers) has three features (1) it's an algorithm (guaranteed to terminate on finite input :-), (2) it allows for various tie-breaking methods, and (3) if you have enough messages from the thread and all In-Reply-To and References conform to the 5337 language, it will be consistent with all References data.
See my other post. The left and right halves can, besides being atoms, be non-folding quotes or literals. So you have to handle that.
Of course, but that's just a SMOP. The harder problem is figuring out what to do about non-conforming input.
Although it has a defined syntax, the semantics are that message-id is an opaque globally-unique identifier. Attempting to parse it as anything but '<[^>]+>' is likely to be a mistake.
That's true in a world where people actually follow the rules.
I doubt 'striping' is worth doing - when an message-id is present, I've almost always seen it include the required <>.
Since the delimiters are constants and required, consistently stripping them doesn't hurt with a well-formed msg-id (the delimiters aren't allowed in a msg-id). And you do have to strip the whitespace, because it's likely that different MUAs will do different things with it since they edit the old value (space-separated, tab-separated). The point is to get to the unique content.
It's also possible to keep the literal content (after stripping whitespace) as well as the "cleaned" version, and use whichever one corresponds to a real message or to a different value of References in the same thread. It would be amusing if *both* were associated with real messages, but that seems unlikely.
Until yesterday's hyperkitty bug, I've never seen <<>>.
At least one of my correspondents has a client that frequently creates "addresses" of the form "<a@b.net <a@b.net>>". I don't recall seeing "<<...>>" before, but I rarely look at Message-IDs.
> I've rarely seen the <> missing - but I'd consider that a warning > that the message format is suspect. As in this case.
Agreed.
My suggestion is that if you can't find a message-ID,
A Message-ID field, or a valid msg-id in the References or In-Reply-To fields?
keep the message in an "unthreaded" bucket. If you sort that by subject (omitting "[listtag]" and "Re:*" (case insensitive, and in multiple languages) then by date, you probably have a useful presentation. And a list of MUAs that need bug reports :-)
The thing is that a message without a Message-ID (which I believe should not happen in Mailman, Mailman will assign one before distributing IIRC) is going to end up being a singleton thread. If there are replies to it, I don't see why they would be likely to be temporally adjacent. If they have proper replies, they will end up as proper thread roots, not in the unthreaded bucket. Am I missing something?
If it's an unusable munged reference in References, the munged reference may be visible as a placeholder (no real message) in a separate subthread, or pruned (and invisible) because no real message is indicated by it. The indicated message will be in a separate subthread if it has a valid References (including no References, in which case it will be a thread root). And, of course, if it's not the immediate parent, other messages' References fields will likely allow the message to be threaded corrects. AFAICS "stripping and cleaning" an invalid msg-id is highly unlikely to duplicate a valid msg-id associated with a different message, athough there's a good chance it won't allow identification of any message at all -- which is where we started.
-- Associate Professor Division of Policy and Planning Science http://turnbull.sk.tsukuba.ac.jp/ Faculty of Systems and Information Email: turnbull@sk.tsukuba.ac.jp University of Tsukuba Tel: 029-853-5175 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN