On 13-Feb-19 21:42, Stephen J. Turnbull wrote:
As I wrote in later post, the message-ID syntactically requires the <>. (just one set).
msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
Look at that for the full syntax and reference.
Sure. It's messy because the id-left could be a "local-part", which could be almost anything if quoted, and the id-right could be a domain literal, which is a little more restricted.
No. It's not an address. "local part" has to do with an address. in a Message-ID,
It's in "id-left", which is a supposed to be something generated by the host that makes the message unique within the namespace defined by "id-right". And while the rfc recommends that be a domain name, it needn't be. In fact, as one of the references I used pointed out, not all hosts have them.
One could generate a UUID and use that for both id-left and id-right & be conforming (though given some client bugs, I'd change the '-' in the standard UUID presentation to '.', or just delete it.) Or one could generate a UUID for the right part, use it for all messages sent by a client, and use something else for the left - like a hash of the message+headers. I tend to use a domain name for the right part, and UUID for the left. Then I prefix the left with a number when I need a Content-ID for a body part. The UUID pretty much makes the right part unnecessary, but since it's syntactically required, when I have a domain name it can be useful for forensics. When I don't, the UUID is fine.
The advantage of using a domain name for id-right is that there is a global registration system (DNS) that makes it unlikely for conflicts to occur. And that was true for RFC733 time, when hosts were king. Modulo all the "example.com". But once PCs came around, it doesn't work as well. DHCP-assigned names, un-named clients just don't provide the same uniqueness as bbn-tenex.com used to.
I think the '@' was just syntactic sugar - if you accept a domain name for id-right, then, as in an e-mail address, it indicates the scope of the unique part (id-left). But just because it can look like an e-mail address doesn't mean it is one.
Treat the whole thing as an opaque string. Nothing else is safe.
2822 does say:
Therefore, trying to form a "References:" field for a reply that has multiple parents is discouraged and how to do so is not defined in this document.
But programmers are rarely discouraged - with GUIs, it's pretty easy to intuit that one might like to check the boxes on several branches of a thread and respond "Of course, you're all right - see my cat photo". :-)
Oh, I've done this by hand! ;-) Yes, but did you fix the References header to match?
5537 tries to make life easier - until you get to:
5322 has adopted the 5537 language for the abstract construction of the field. It doesn't say anything at all about the logical length issue and trimming.
Yes, but you brought up the news RFCs, IIRC in saying that they defined threading.
My quote was verbatim. 5537 does say you must trim. The point is that the news rfcs may have introduced References, but they have different obstacles to reconstructing threading.
The best one can do is to come up with a "good enough" approximation in the available time, and tinker with/improve it until bored.
Which is what Jamie Zawinski did. His algorithm (adopted by IMAP as the standard for threading IMAP servers) has three features (1) it's an algorithm (guaranteed to terminate on finite input :-), (2) it allows for various tie-breaking methods, and (3) if you have enough messages from the thread and all In-Reply-To and References conform to the 5337 language, it will be consistent with all References data.
I haven't looked at that. Given the trimming in 3.4.4 of 5337, I don't see how this can be produce the whole thread, unless you are guaranteed to have the complete thread to fill the gaps. And you're not. (3) is the issue - you need "enough" and in the worst case, you get 3 references (First and last two). Plus, you can lose messages for strange reasons: The moderator deleted one for inappropriate content. A message was copied to the list and the poster. The response to the list is lost. But a reply from the poster happens later. So the response in question is never seen by the server, though the reference is. Life is hard :-)
As a practical matter, I don't expect a sane coder to trim unless necessary. So there's a good chance that you can reconstruct a complete thread in real life. But "necessary" when your main disk was an 8" floppy (or paper tape) seemed different from my notebook PC has a couple of TB SSDs. That doesn't mean that the code has changed.
Heuristics are fine. Guaranteeing that the resulting algorithm terminates is important. But real life doesn't get you to "enough messages" 100% of the time.
See my other post. The left and right halves can, besides being atoms, be non-folding quotes or literals. So you have to handle that.
Of course, but that's just a SMOP. The harder problem is figuring out what to do about non-conforming input. And, as I've noted: lost input.
Although it has a defined syntax, the semantics are that message-id is an opaque globally-unique identifier. Attempting to parse it as anything but '<[^>]+>' is likely to be a mistake.
That's true in a world where people actually follow the rules. When they don't, trying to guess at the semantics of an opaque ID will seem to work for a while. But it amounts to the halting problem. There are lots of variants of "generate a globally unique ID with an '@' in the middle". I may have created a new one today :-)
I doubt 'striping' is worth doing - when an message-id is present, I've almost always seen it include the required <>.
Since the delimiters are constants and required, consistently stripping them doesn't hurt with a well-formed msg-id (the delimiters aren't allowed in a msg-id). And you do have to strip the whitespace, because it's likely that different MUAs will do different things with it since they edit the old value (space-separated, tab-separated). The point is to get to the unique content.
Yes. But if they're well-formed, the <> are there, so stripping them only saves a couple of bytes.
If they're not, all bets are off. <two<three@example.net> - stripping the outer <>s doesn't help.
<"two<three"@example.net> is a valid message-Id, and distinct from <"twothree"@example.net. And is
<twothree@example.net> distinct from <"twothree"@example.net>? (unnecessary quoting, or a distinct message-ID) If you treat it as opaque, you don't care. Take the whole thing, <@> included, as given and look for it in the other fields. The most you might do is remove quotes (and escapes) and use the left and right parts as your key.
The whitespace is the [CFWS] on either side of the "<" id-left "# id-right " > production in 2822.
That's where you must strip it to get the unique ID.
It's also possible to keep the literal content (after stripping whitespace) as well as the "cleaned" version, and use whichever one corresponds to a real message or to a different value of References in the same thread. It would be amusing if *both* were associated with real messages, but that seems unlikely.
You could also fingerprint the user agent (e.g. by the order of headers, format of message-ID), and correct for its bugs. But I'm inclined to report client bugs and get them fixed. Meantime, their messages are unthreaded (but not lost). It's just not worth working around other people's bugs - there are better uses for your time. Just ask the DNS people. They just had a Flag Day to get rid of years of built-up workarounds in nameservers...
Until yesterday's hyperkitty bug, I've never seen <<>>.
At least one of my correspondents has a client that frequently creates "addresses" of the form "<a@b.net <a@b.net>>". I don't recall seeing "<<...>>" before, but I rarely look at Message-IDs.
> I've rarely seen the <> missing - but I'd consider that a warning > that the message format is suspect. As in this case.
Agreed.
My suggestion is that if you can't find a message-ID,
A Message-ID field, or a valid msg-id in the References or In-Reply-To fields?
I meant the latter. It's the references that you need to reconstruct the thread - to the extent possible.
keep the message in an "unthreaded" bucket. If you sort that by subject (omitting "[listtag]" and "Re:*" (case insensitive, and in multiple languages) then by date, you probably have a useful presentation. And a list of MUAs that need bug reports :-)
The thing is that a message without a Message-ID (which I believe should not happen in Mailman, Mailman will assign one before distributing IIRC) is going to end up being a singleton thread. If there are replies to it, I don't see why they would be likely to be temporally adjacent. If they have proper replies, they will end up as proper thread roots, not in the unthreaded bucket. Am I missing something?
Yes. A message ID should exist. These days I think all MTAs will assign one if the client doesn't.
What I tried to say was that there will be times where you can't figure out where to put a message into the thread graph (or forest). Because it references message that you can't find, or it doesn't have a references or reply-to, or those fields are corrupt.
In those cases, put them in a bucket. Call them orphans.
Most conversations come in bursts. And if I reply to you, my reply will (modulo time errors) be later than your post.
Further, there's a pretty good chance that my subject will be "Re: yours".
So, if that bucket is sorted by subject (primary key) and date (secondary), there's a fair approximation that within the set of orphans, they'll be near each-other. Which is better than nothing.
You're right that if my client generates good headers and yours doesn't, yours will be orphaned. But at least all your replies will share a subject, so in that bucket will be your side of the conversation, roughly in order. And if you quote my text when replying, both sides.
If several people have clients that orphan their replies, they all will cluster because of the subject. And be in rough time order.
You do want to skip any [list] tag and [Re:]* because they don't change the subject.
These are heuristics - you might come up with better ones. The idea is that they don't rely on or try to fix any particular fault in the client's posts. Either a message conforms, or it doesn't. If it does, it gets threaded as everyone expects. If it doesn't , it's an orphan, and there's one place to find it. And with a modest effort, this should keep related ones close to each-other. It's better than just keeping the orphans sorted by date - the subject tends to be stable. But it's not perfect. E.g. the not-uncomon 'Re: dogs are better (was cats rule)' won't keep the thread together.
But it seems better than nothing, without endless improvement. Once the afflicted users learn that messages they post are put into the Orphans thread, they can complain to their clients' developers. Meantime, life should be tolerable - but annoying enough for them to keep the pressure on for a fix :-)
If it's an unusable munged reference in References, the munged reference may be visible as a placeholder (no real message) in a separate subthread, or pruned (and invisible) because no real message is indicated by it. The indicated message will be in a separate subthread if it has a valid References (including no References, in which case it will be a thread root). And, of course, if it's not the immediate parent, other messages' References fields will likely allow the message to be threaded corrects. AFAICS "stripping and cleaning" an invalid msg-id is highly unlikely to duplicate a valid msg-id associated with a different message, athough there's a good chance it won't allow identification of any message at all -- which is where we started.
Yes, and that's why I think it's wasted effort. Opaque means "Opaque". If you don't have one that conforms, use something else, and use the imperfections to get the bug(s) fixed.
At least with an orphans bucket, you don't end up with "invisible" messages. They're just not where you expected. Like the SPAM folder :-) You know where to look, and you know that anything in there means there's a bug to fix.