[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Feb. 14, 2019 · *both*


      tlhackque writes:
...
Not sure why you believe this.  RFC2822 3.6.4 defines References
for e-mail.
Ouch!  I went to RFC 5322, searched for "References" and was taken to
"Section 7. References".  Evidently it was already positioned at the
end of the appendices. :-(
...
As I wrote in later post, the message-ID syntactically requires the <>. 
(just one set).
msg-id          =       [CFWS] "<" id-left "@" id-right ">" [CFWS]
Look at that for the full syntax and reference.
Sure.  It's messy because the id-left could be a "local-part", which
could be almost anything if quoted, and the id-right could be a domain
literal, which is a little more restricted.
...
2822 does say:
Therefore, trying to form a
   "References:" field for a reply that has multiple parents is
   discouraged and how to do so is not defined in this document.
But programmers are rarely discouraged - with GUIs, it's pretty easy to
intuit that one might like to check the boxes on several branches of a
thread and respond "Of course, you're all right - see my cat photo". :-)
Oh, I've done this by hand! ;-)
...
5537 tries to make life easier - until you get to:
5322 has adopted the 5537 language for the abstract construction of
the field.  It doesn't say anything at all about the logical length
issue and trimming.
...
The best one can do is to come up with a "good enough"
approximation in the available time, and tinker with/improve it
until bored.
Which is what Jamie Zawinski did.  His algorithm (adopted by IMAP as
the standard for threading IMAP servers) has three features (1) it's
an algorithm (guaranteed to terminate on finite input :-), (2) it
allows for various tie-breaking methods, and (3) if you have enough
messages from the thread and all In-Reply-To and References conform to
the 5337 language, it will be consistent with all References data.
...
See my other post.  The left and right halves can, besides being atoms,
be non-folding quotes or literals.  So you have to handle that.
Of course, but that's just a SMOP.  The harder problem is figuring out
what to do about non-conforming input.
...
Although it has a defined syntax, the semantics are that message-id is
an opaque globally-unique identifier.  Attempting to parse it as
anything but '<[^>]+>'  is likely to be a mistake.
That's true in a world where people actually follow the rules.
...
I doubt 'striping' is worth doing - when an message-id is present,
I've almost always seen it include the required <>.
Since the delimiters are constants and required, consistently
stripping them doesn't hurt with a well-formed msg-id (the delimiters
aren't allowed in a msg-id).  And you do have to strip the whitespace,
because it's likely that different MUAs will do different things with
it since they edit the old value (space-separated, tab-separated).
The point is to get to the unique content.
It's also possible to keep the literal content (after stripping
whitespace) as well as the "cleaned" version, and use whichever one
corresponds to a real message or to a different value of References in
the same thread.  It would be amusing if *both* were associated with
real messages, but that seems unlikely.
...
Until yesterday's hyperkitty bug, I've never seen <<>>.
At least one of my correspondents has a client that frequently
creates "addresses" of the form "<a@b.net <a@b.net>>".  I don't recall
seeing "<<...>>" before, but I rarely look at Message-IDs.
 > I've rarely seen the <> missing - but I'd consider that a warning
> that the message format is suspect.  As in this case.
Agreed.
...
My suggestion is that if you can't find a message-ID,
A Message-ID field, or a valid msg-id in the References or In-Reply-To
fields?
...
keep the message in an "unthreaded" bucket.  If you sort that by
subject (omitting "[listtag]" and "Re:*" (case insensitive, and in
multiple languages) then by date, you probably have a useful
presentation.  And a list of MUAs that need bug reports :-)
The thing is that a message without a Message-ID (which I believe
should not happen in Mailman, Mailman will assign one before
distributing IIRC) is going to end up being a singleton thread.  If
there are replies to it, I don't see why they would be likely to be
temporally adjacent.  If they have proper replies, they will end up as
proper thread roots, not in the unthreaded bucket.  Am I missing
something?
If it's an unusable munged reference in References, the munged
reference may be visible as a placeholder (no real message) in a
separate subthread, or pruned (and invisible) because no real message
is indicated by it.  The indicated message will be in a separate
subthread if it has a valid References (including no References, in
which case it will be a thread root).  And, of course, if it's not the
immediate parent, other messages' References fields will likely allow
the message to be threaded corrects.  AFAICS "stripping and cleaning"
an invalid msg-id is highly unlikely to duplicate a valid msg-id
associated with a different message, athough there's a good chance it
won't allow identification of any message at all -- which is where we
started.
--
Associate Professor              Division of Policy and Planning Science
http://turnbull.sk.tsukuba.ac.jp/     Faculty of Systems and Information
Email: turnbull@sk.tsukuba.ac.jp                   University of Tsukuba
Tel: 029-853-5175                 Tennodai 1-1-1, Tsukuba 305-8573 JAPAN

[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Stephen J. Turnbull