[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Feb. 12, 2019 · *don't*

      Notes inline.
On 12-Feb-19 11:09, Stephen J. Turnbull wrote:
...
tlhackque writes:
...
I believe that for threading to work more or less reliably, the
thing to do is to look at the 'References' header.
That should give you the thread in order, and allow any message
received out of sequence to be put in the proper location in your
display.
This is more difficult than it seems, since References is not defined
for mail as far as I know (it's a netnews concept originally, and RFC
5337 is a netnews-specific RFC).
Not sure why you believe this.  RFC2822 3.6.4 defines References for e-mail.
See https://tools.ietf.org/html/rfc2822#page-25
As I wrote in later post, the message-ID syntactically requires the <>. 
(just one set).
msg-id          =       [CFWS] "<" id-left "@" id-right ">" [CFWS]
Look at that for the full syntax and reference.
...
Although it is adopted by most mail
clients, there's no guarantee of strict conformance.
I agree that conformance has traditionally been hit-or-miss with almost
everything about e-mail.  Sigh.
I also agree that life is hard.  There have been clients that use the
same message-id for all messages, and it's a long-standing problem. E.g.
See https://cr.yp.to/immhf/thread.html
Not only do you not have a guarantee of ever getting all links, you
don't have a guarantee of the order in which they arrive.  So one can
have gaps ( parent, reply2-delayed, reply 3 references reply 2 and
parent, reply 2 might arrive - or note.)  You also can have a reply that
references 2 or more branches.  E.g.
    /--R0
p/--r1 

\ --r2 /
And a reply (r3) comes that references r1 AND r2 (and the parent) -
which you probably want to show as a descendant of both r1 AND r2.  R4
replies to r2 - but that's also an implicit response to r1.
Once you's sorted that out, r5 comes along that references R0, r4, and P.
2822 does say:
Therefore, trying to form a
   "References:" field for a reply that has multiple parents is
   discouraged and how to do so is not defined in this document.
But programmers are rarely discouraged - with GUIs, it's pretty easy to
intuit that one might like to check the boxes on several branches of a
thread and respond "Of course, you're all right - see my cat photo". :-)
5537 tries to make life easier - until you get to:
If the resulting References header field would, after unfolding,
   exceed 998 characters in length (including its field name but not the
   final CRLF), it MUST be trimmed (and otherwise MAY be trimmed).
   Trimming means removing any number of message identifiers from its
   content, except that the first message identifier and the last two
   MUST NOT be removed.
So, even if you get all the messages, you don't necessarily get all the
references that you need.
I'm sure glad that I'm not working on Hyperkitty!  But I'm not holding
my breath for when MM3 is complete enough to adopt in my environment.
Yes, "In-Reply-To" is more often present (likely because it's modestly simpler for RFC-readers to
understand - it's one thing, not a list.)  It's less expressive (hence, informative.)  One can
try to use it as a fallback when no References is present (and that's what 2822 implies).
The whole undertaking in the real world is an example of the halting problem.  Or Parkinson's Law.
The best one can do is to come up with a "good enough" approximation in the available time, and
tinker with/improve it until bored.
It's likely to be a bunch of heuristics.  And if you're not careful, throw in the Peter principle :-(
...
In mail,
In-Reply-To is more reliably present, but of course asynchronicity
means you have no guarantee of a complete set of all links.  Even if
all clients conform to the netnews RFC, you still need to create the
full tree (full conformance to RFC 5537 means it will be a tree), and
break ties between branches in some arbitrary way to create a total
order.  It's worse if you *don't* have conformance from all clients:
you can get a DAG or even something that isn't even a DAG.  So, even
today, you can't assume a well-behaved ancestry graph.
...
'References' should be a superset of Reply-To (which is at most 1
Message-ID), so you only need Reply-To if there is no References - or to
handle a client that doesn't obey the RFCs.
I suppose HyperKitty uses References (it works for messages that have
proper Message-IDs ;-), but I don't know what algorithm it uses.
Might be worth looking into, as well as considering a more Postelian
parsing of Message-IDs.  Specifically, take the field body, unfold it,
strip leading and trailing whitespace and leading "<" and trailing
">", and whatever's left is the message ID.
Alternatively, strip everything that's not atext or "@" (including
inside the purported Message-ID).
See my other post.  The left and right halves can, besides being atoms,
be non-folding quotes or literals.  So you have to handle that. 
Although I have recently seen e-mail clients that are confused when you
do it. (Ran into
that generating content-id for multipart/related.  Yes, MIME makes plain
e-mail look straightforward.)
Although it has a defined syntax, the semantics are that message-id is
an opaque globally-unique identifier.  Attempting to parse it as
anything but '<[^>]+>'  is likely to be a mistake.  (On receipt;
generating is the other side
of Postel-correctness...  I commend RFC 2468 to anyone who hasn't read it.)
I doubt 'striping' is worth doing - when an message-id is present, I've
almost always seen it include the required <>.
Until yesterday's hyperkitty bug, I've never seen <<>>.  I've rarely
seen the <> missing - but I'd consider that a warning that the message
format is suspect.  As in this case.
My suggestion is that if you can't find a message-ID, keep the message
in an "unthreaded" bucket.  If you sort that by subject (omitting
"[listtag]" and "Re:*" (case insensitive, and in multiple languages)
then by date, you probably have a useful presentation.  And a list of
MUAs that need bug reports :-)
Yes, althought 2822 says that "Re:" is the proper introducer in the
subject of a reply, RE: is common, and I've seen it (incorrectly)
translated in the headers.)
As I said, "life is hard".
...
This won't break any RFC 5537-valid
Message-IDs, but might identify two different, nonconforming
Message-IDs as the same (too bad if they can't take a joke!), or
identify a nonconforming message with a conforming one (<sad_emoji />.
Thoughts?
Steve

Mailman-users mailing list -- mailman-users@mailman3.org
To unsubscribe send an email to mailman-users-leave@mailman3.org
https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/