No. It's not an address. "local part" has to do with an address. in a Message-ID,
It's syntactically a local-part or other stuff. RFC 5322:
4.5.4. Obsolete Identification Fields
The obsolete "In-Reply-To:" and "References:" fields differ from the current syntax in that they allow phrase (words or quoted strings) to appear. The obsolete forms of the left and right sides of msg-id allow interspersed CFWS, making them syntactically identical to local-part and domain, respectively.
It's in "id-left", which is a supposed to be something generated by the host that makes the message unique within the namespace defined by "id-right".
Right, those are the *semantics recommended to implementers*, for the reasons which you summarize accurately. But a validating parser doesn't care about that.
Treat the whole thing as an opaque string. Nothing else is safe.
s/else// and I'll agree with you. But safety is not an issue in threading (except for DoS if the procedure might be non-terminating; I suppose you could argue DoS if a user misses out-of-order mail from their boss and gets fired).
Yes, but did you fix the References header to match?
Of course I did. That was long before I was a Mailman developer and started reading the RFCs, of course.
My quote was verbatim. 5537 does say you must trim. The point is that the news rfcs may have introduced References, but they have different obstacles to reconstructing threading.
I don't think that's right, because mail suffers from the same line length restriction, with no exception for References. Mail has strictly more specification bugs. :-(
I haven't looked at that. Given the trimming in 3.4.4 of 5337, I don't see how this can be produce the whole thread, unless you are guaranteed to have the complete thread to fill the gaps.
It can't because you aren't. But it seems unlikely that anybody is producing msg-ids longer than 74 characters, so 998/(74+1) = 13 is a long enough gap that you probably don't care that they're the same thread. At least not in any forum I participate in!
Plus, you can lose messages for strange reasons: The moderator deleted one for inappropriate content. A message was copied to the list and the poster. The response to the list is lost. But a reply from the poster happens later. So the response in question is never seen by the server, though the reference is. Life is hard :-) >
Sure, but Jamie's algorithm *will get that right* as long as all messages have a semantically correct In-Reply-To or References, and at least one descendent of the missing message mentions both it and its parent. What remains is a UI question: missing message placeholder, to display or not to display?
Heuristics are fine. Guaranteeing that the resulting algorithm terminates is important. But real life doesn't get you to "enough messages" 100% of the time.
100% isn't necessary. If the thread is sufficiently broken that Jamie's msg-id-based procedure doesn't work, his full algorithm falls back to your idea of collecting the singletons by subject and sorting. (He doesn't specify the ordering criteria, rather what objects (= subthreads of the same parent) are to be compared. Your suggestion of date seems most appropriate for the leftovers.)
See my other post. The left and right halves can, besides being atoms, be non-folding quotes or literals. So you have to handle that.
Of course, but that's just a SMOP. The harder problem is figuring out what to do about non-conforming input.
And, as I've noted: lost input.
Lost input does not prevent thread reconstruction as long as (1) all messages have a Message-ID, (2) "enough" messages have a non-empty References containing "enough" msg-ids, (3) those References do not misorder the msg-ids or introduce msg-ids corresponding to messages not in the thread. It should be obvious why I don't want to define "enough" (except in terms like "enough means Jamie's algorithm can reconstruct the thread" ;-), but in particular losing *one* message will certainly not prevent reconstruction.
When they don't, trying to guess at the semantics of an opaque ID will seem to work for a while. But it amounts to the halting problem. There are lots of variants of "generate a globally unique ID with an '@' in the middle". I may have created a new one today :-)
The problem I'm trying to address with the "stripping" is incorrect copying of msg-ids by MUAs that try to parse them, as we saw here. Apparently the bug was in Mailman itself, so it's not required (and a bad idea, I guess).
Yes. But if they're well-formed, the <> are there, so stripping them only saves a couple of bytes.
If they're not, all bets are off. <email@example.com - stripping the outer <>s doesn't help.
<"two<three"@example.net> is a valid message-Id, and distinct from <"twothree"@example.net.
Sure. Is it likely?
firstname.lastname@example.org distinct from <"twothree"@example.net>? (unnecessary quoting, or a distinct message-ID) If you treat it as opaque, you don't care. Take the whole thing, <@> included, as given and look for it in the other fields. The most you might do is remove quotes (and escapes) and use the left and right parts as your key.
Which is basically what my stripping and cleaning procedure would do. The question is "does it help or hurt, on net?" If there was a widely distributed MUA out there that doubled the delimiters, it would help. Since it was Mailman doing that and it will be fixed, it's a bad idea.
You could also fingerprint the user agent (e.g. by the order of headers, format of message-ID), and correct for its bugs. But I'm inclined to report client bugs and get them fixed. Meantime, their messages are unthreaded (but not lost). It's just not worth working around other people's bugs - there are better uses for your time.
That's not my experience. See Reply-To munging. Depends on the bug, of course, but all too often people think it's our job to help them deal with bad MUA design.
Yes. A message ID should exist. These days I think all MTAs will assign one if the client doesn't.
What I tried to say was that there will be times where you can't figure out where to put a message into the thread graph (or forest). Because it references message that you can't find, or it doesn't have a references or reply-to, or those fields are corrupt.
None of those prevent you from threading that message. The only reasons you won't be able to thread a message at all are when no other available message references it (eg, if several MUAs in a row supply In-Reply-To but not References, and the middle messages are missing or their identification fields are corrupt), and when there are References that have conflicting opinions on where that message belongs. Of course if a message lacks References and In-Reply-To it will be identified as a thread root (unless some descendent manually corrects References ;-). Even then it could be grafted into the tree more or less correctly using your sort on subject and date procedure.
These are heuristics - you might come up with better ones.
Yours are already implemented in Pipermail, I believe, I'm not sure about HyperKitty. I don't think Jamie's algorithm messed with [name] and [serial number] because they weren't common at that time, but he stripped Re: and its nonconforming variants, as well as "Fwd:".
At least with an orphans bucket, you don't end up with "invisible" messages.
Nothing I suggested produces invisible messages, just a different set of orphans or occasionally a misthreaded message that seems to be a duplicate if different msg-ids are munged to the same string. At least not in Jamie's algorithm. I don't know exactly what algorithms are used in Pipermail and HyperKitty.
Interesting discussion, but I've reached the point of diminishing returns. I will be checking to see if any of your suggestions are improvements over what HyperKitty currently does, for sure!