[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Feb. 15, 2019 · *semantics recommended to implementers*


      tlhackque writes:
...
No.  It's not an address.  "local part" has to do with an address.  in a
Message-ID,
It's syntactically a local-part or other stuff.  RFC 5322:
4.5.4.  Obsolete Identification Fields
The obsolete "In-Reply-To:" and "References:" fields differ from the
current syntax in that they allow phrase (words or quoted strings) to
appear.  The obsolete forms of the left and right sides of msg-id
allow interspersed CFWS, making them syntactically identical to
local-part and domain, respectively.
...
It's in "id-left", which is a supposed to be something generated by the
host that makes the message unique within the namespace defined by
"id-right".
Right, those are the *semantics recommended to implementers*, for the
reasons which you summarize accurately.  But a validating parser
doesn't care about that.
...
Treat the whole thing as an opaque string.  Nothing else is safe.
s/else// and I'll agree with you.  But safety is not an issue in
threading (except for DoS if the procedure might be non-terminating; I
suppose you could argue DoS if a user misses out-of-order mail from
their boss and gets fired).
...
Yes, but did you fix the References header to match?
Of course I did.  That was long before I was a Mailman developer and
started reading the RFCs, of course.
...
My quote was verbatim.  5537 does say you must trim.  The point is
that  the news rfcs may have introduced References, but they have
different obstacles to reconstructing threading.
I don't think that's right, because mail suffers from the same line
length restriction, with no exception for References.  Mail has
strictly more specification bugs. :-(
...
I haven't looked at that.  Given the trimming in 3.4.4 of 5337, I
don't see how this can be produce the whole thread, unless you are
guaranteed to have the complete thread to fill the gaps.
It can't because you aren't.  But it seems unlikely that anybody is
producing msg-ids longer than 74 characters, so 998/(74+1) = 13 is a
long enough gap that you probably don't care that they're the same
thread.  At least not in any forum I participate in!
...
Plus, you can lose messages for strange reasons: The moderator
deleted one for inappropriate content.  A message was copied to the
list and the poster.  The response to the list is lost.  But a
reply from the poster happens later.  So the response in question
is never seen by the server, though the reference is.  Life is hard
:-) >
Sure, but Jamie's algorithm *will get that right* as long as all
messages have a semantically correct In-Reply-To or References, and at
least one descendent of the missing message mentions both it and its
parent.  What remains is a UI question: missing message placeholder,
to display or not to display?
...
Heuristics are fine.  Guaranteeing that the resulting algorithm
terminates is important.  But real life doesn't get you to "enough
messages" 100% of the time.
100% isn't necessary.  If the thread is sufficiently broken that
Jamie's msg-id-based procedure doesn't work, his full algorithm falls
back to your idea of collecting the singletons by subject and sorting.
(He doesn't specify the ordering criteria, rather what objects (=
subthreads of the same parent) are to be compared.  Your suggestion of
date seems most appropriate for the leftovers.)
...
...
...
See my other post.  The left and right halves can, besides being atoms,
be non-folding quotes or literals.  So you have to handle that.
Of course, but that's just a SMOP.  The harder problem is figuring out
what to do about non-conforming input.
...
And, as I've noted: lost input.
Lost input does not prevent thread reconstruction as long as (1) all
messages have a Message-ID, (2) "enough" messages have a non-empty
References containing "enough" msg-ids, (3) those References do not
misorder the msg-ids or introduce msg-ids corresponding to messages
not in the thread.  It should be obvious why I don't want to define
"enough" (except in terms like "enough means Jamie's algorithm can
reconstruct the thread" ;-), but in particular losing *one* message
will certainly not prevent reconstruction.
...
When they don't, trying to guess at the semantics of an opaque ID will
seem to work for a while.  But it amounts to the halting problem.  There
are lots of variants of "generate a globally unique ID with an '@' in
the middle".  I may have created a new one today :-)
The problem I'm trying to address with the "stripping" is incorrect
copying of msg-ids by MUAs that try to parse them, as we saw here.
Apparently the bug was in Mailman itself, so it's not required (and a
bad idea, I guess).
...
Yes.  But if they're well-formed, the <> are there, so stripping them
only saves a couple of bytes.
If they're not, all bets are off.  <two<three@example.net> - stripping
the outer <>s doesn't help.
<"two<three"@example.net>  is a valid message-Id,
and distinct from <"twothree"@example.net.
Sure.  Is it likely?
...
<twothree@example.net> distinct from <"twothree"@example.net>?
(unnecessary quoting, or a distinct message-ID)  If you treat it as
opaque, you don't care.  Take the whole thing, <@> included, as given
and look for it in the other fields.  The most you might do is remove
quotes (and escapes) and use the left and right parts as your key. 
Which is basically what my stripping and cleaning procedure would do.
The question is "does it help or hurt, on net?"  If there was a widely
distributed MUA out there that doubled the delimiters, it would help.
Since it was Mailman doing that and it will be fixed, it's a bad idea.
...
You could also fingerprint the user agent (e.g. by the order of headers,
format of message-ID), and correct for its bugs.  But I'm inclined to
report client bugs and get them fixed.  Meantime, their messages are
unthreaded (but not lost).  It's just not worth working around other
people's bugs - there are better uses for your time.
That's not my experience.  See Reply-To munging.  Depends on the bug,
of course, but all too often people think it's our job to help them
deal with bad MUA design.
...
Yes. A message ID should exist.  These days I think all MTAs will assign
one if the client doesn't.
...
What I tried to say was that there will be times where you can't figure
out where to put a message into the thread graph (or forest). Because it
references message that you can't find, or it doesn't have a references
or reply-to, or those fields are corrupt.
None of those prevent you from threading that message.  The only
reasons you won't be able to thread a message at all are when no other
available message references it (eg, if several MUAs in a row supply
In-Reply-To but not References, and the middle messages are missing or
their identification fields are corrupt), and when there are
References that have conflicting opinions on where that message
belongs.  Of course if a message lacks References and In-Reply-To it
will be identified as a thread root (unless some descendent manually
corrects References ;-).  Even then it could be grafted into the tree
more or less correctly using your sort on subject and date procedure.
...
These are heuristics - you might come up with better ones.
Yours are already implemented in Pipermail, I believe, I'm not sure
about HyperKitty.  I don't think Jamie's algorithm messed with [name]
and [serial number] because they weren't common at that time, but he
stripped Re: and its nonconforming variants, as well as "Fwd:".
...
At least with an orphans bucket, you don't end up with "invisible"
messages.
Nothing I suggested produces invisible messages, just a different set
of orphans or occasionally a misthreaded message that seems to be a
duplicate if different msg-ids are munged to the same string.  At
least not in Jamie's algorithm.  I don't know exactly what algorithms
are used in Pipermail and HyperKitty.
Interesting discussion, but I've reached the point of diminishing
returns.  I will be checking to see if any of your suggestions are
improvements over what HyperKitty currently does, for sure!
Steve

[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Stephen J. Turnbull