[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

Feb. 14, 2019 · *both*


      On 13-Feb-19 21:42, Stephen J. Turnbull wrote:
...
...
As I wrote in later post, the message-ID syntactically requires the <>. 
(just one set).
msg-id          =       [CFWS] "<" id-left "@" id-right ">" [CFWS]
Look at that for the full syntax and reference.
Sure.  It's messy because the id-left could be a "local-part", which
could be almost anything if quoted, and the id-right could be a domain
literal, which is a little more restricted.
No.  It's not an address.  "local part" has to do with an address.  in a
Message-ID,
It's in "id-left", which is a supposed to be something generated by the
host that makes the message unique within the namespace defined by
"id-right".  And while the rfc recommends that be a domain name, it
needn't be.  In fact, as one of the references I used pointed out, not
all hosts have them.
One could generate a UUID and use that for both id-left and id-right &
be conforming (though given some client bugs, I'd change the '-' in the
standard UUID presentation to '.', or just delete it.)  Or one could
generate a UUID for the right part, use it for all messages sent by a
client, and use something else for the left - like a hash of the
message+headers.  I tend to use a domain name for the right part, and
UUID for the left.  Then I prefix the left with a number when I need a
Content-ID for a body part.  The UUID pretty much makes the right part
unnecessary, but since it's syntactically required, when I have a domain
name it can be useful for forensics.  When I don't, the UUID is fine.
The advantage of using a domain name for id-right is that there is a
global registration system (DNS) that makes it unlikely for conflicts to
occur.  And that was true for RFC733 time, when hosts were king.  Modulo
all the "example.com".  But once PCs came around, it doesn't work as
well.  DHCP-assigned names, un-named clients just don't provide the same
uniqueness as bbn-tenex.com used to.
I think the '@' was just syntactic sugar - if you accept a domain name
for id-right, then, as in an e-mail address, it indicates the scope of
the unique part (id-left).  But just because it can look like an e-mail
address doesn't mean it is one.
Treat the whole thing as an opaque string.  Nothing else is safe.
...
...
2822 does say:
Therefore, trying to form a
   "References:" field for a reply that has multiple parents is
   discouraged and how to do so is not defined in this document.
But programmers are rarely discouraged - with GUIs, it's pretty easy to
intuit that one might like to check the boxes on several branches of a
thread and respond "Of course, you're all right - see my cat photo". :-)
Oh, I've done this by hand! ;-)
Yes, but did you fix the References header to match?
...
5537 tries to make life easier - until you get to:
5322 has adopted the 5537 language for the abstract construction of
the field.  It doesn't say anything at all about the logical length
issue and trimming.
Yes, but you brought up the news RFCs, IIRC in saying that they defined
threading.
My quote was verbatim.  5537 does say you must trim.  The point is that
the news rfcs may have introduced References, but they have different
obstacles to reconstructing threading.
...
...
The best one can do is to come up with a "good enough"
approximation in the available time, and tinker with/improve it
until bored.
Which is what Jamie Zawinski did.  His algorithm (adopted by IMAP as
the standard for threading IMAP servers) has three features (1) it's
an algorithm (guaranteed to terminate on finite input :-), (2) it
allows for various tie-breaking methods, and (3) if you have enough
messages from the thread and all In-Reply-To and References conform to
the 5337 language, it will be consistent with all References data.
I haven't looked at that.  Given the trimming in 3.4.4 of 5337, I don't
see how this can be produce the whole thread, unless you are guaranteed
to have the complete thread to fill the gaps.  And you're not.  (3) is
the issue - you need "enough" and in the worst case, you get 3
references (First and last two).   Plus, you can lose messages for
strange reasons: The moderator deleted one for inappropriate content.  A
message was copied to the list and the poster.  The response to the list
is lost.  But a reply from the poster happens later.  So the response in
question is never seen by the server, though the reference is.  Life is
hard :-)
As a practical matter, I don't expect a sane coder to trim unless
necessary.  So there's a good chance that you can reconstruct a complete
thread in real life.  But "necessary" when your main disk was an 8"
floppy (or paper tape) seemed different from my notebook PC has a couple
of TB SSDs.  That doesn't mean that the code has changed.
Heuristics are fine.  Guaranteeing that the resulting algorithm
terminates is important.  But real life doesn't get you to "enough
messages" 100% of the time.
...
...
See my other post.  The left and right halves can, besides being atoms,
be non-folding quotes or literals.  So you have to handle that.
Of course, but that's just a SMOP.  The harder problem is figuring out
what to do about non-conforming input.
And, as I've noted: lost input.
...
Although it has a defined syntax, the semantics are that message-id is
an opaque globally-unique identifier.  Attempting to parse it as
anything but '<[^>]+>'  is likely to be a mistake.
That's true in a world where people actually follow the rules.
When they don't, trying to guess at the semantics of an opaque ID will
seem to work for a while.  But it amounts to the halting problem.  There
are lots of variants of "generate a globally unique ID with an '@' in
the middle".  I may have created a new one today :-)
...
I doubt 'striping' is worth doing - when an message-id is present,
I've almost always seen it include the required <>.
Since the delimiters are constants and required, consistently
stripping them doesn't hurt with a well-formed msg-id (the delimiters
aren't allowed in a msg-id).  And you do have to strip the whitespace,
because it's likely that different MUAs will do different things with
it since they edit the old value (space-separated, tab-separated).
The point is to get to the unique content.
Yes.  But if they're well-formed, the <> are there, so stripping them
only saves a couple of bytes.
If they're not, all bets are off.  <two<three@example.net> - stripping
the outer <>s doesn't help.
<"two<three"@example.net>  is a valid message-Id, and distinct from
<"twothree"@example.net.  And is
<twothree@example.net> distinct from <"twothree"@example.net>?
(unnecessary quoting, or a distinct message-ID)  If you treat it as
opaque, you don't care.  Take the whole thing, <@> included, as given
and look for it in the other fields.  The most you might do is remove
quotes (and escapes) and use the left and right parts as your key. 
The whitespace is the [CFWS] on either side of the "<" id-left "#
id-right " > production in 2822.
That's where you must strip it to get the unique ID.
...
It's also possible to keep the literal content (after stripping
whitespace) as well as the "cleaned" version, and use whichever one
corresponds to a real message or to a different value of References in
the same thread.  It would be amusing if *both* were associated with
real messages, but that seems unlikely.
You could also fingerprint the user agent (e.g. by the order of headers,
format of message-ID), and correct for its bugs.  But I'm inclined to
report client bugs and get them fixed.  Meantime, their messages are
unthreaded (but not lost).  It's just not worth working around other
people's bugs - there are better uses for your time.  Just ask the DNS
people.  They just had a Flag Day to get rid of years of built-up
workarounds in nameservers...
...
...
Until yesterday's hyperkitty bug, I've never seen <<>>.
At least one of my correspondents has a client that frequently
creates "addresses" of the form "<a@b.net <a@b.net>>".  I don't recall
seeing "<<...>>" before, but I rarely look at Message-IDs.
 > I've rarely seen the <> missing - but I'd consider that a warning
> that the message format is suspect.  As in this case.
Agreed.
...
My suggestion is that if you can't find a message-ID,
A Message-ID field, or a valid msg-id in the References or In-Reply-To
fields?
I meant the latter.   It's the references that you need to reconstruct
the thread - to the extent possible.
...
...
keep the message in an "unthreaded" bucket.  If you sort that by
subject (omitting "[listtag]" and "Re:*" (case insensitive, and in
multiple languages) then by date, you probably have a useful
presentation.  And a list of MUAs that need bug reports :-)
The thing is that a message without a Message-ID (which I believe
should not happen in Mailman, Mailman will assign one before
distributing IIRC) is going to end up being a singleton thread.  If
there are replies to it, I don't see why they would be likely to be
temporally adjacent.  If they have proper replies, they will end up as
proper thread roots, not in the unthreaded bucket.  Am I missing
something?
Yes. A message ID should exist.  These days I think all MTAs will assign
one if the client doesn't.
What I tried to say was that there will be times where you can't figure
out where to put a message into the thread graph (or forest). Because it
references message that you can't find, or it doesn't have a references
or reply-to, or those fields are corrupt.
In those cases, put them in a bucket.  Call them orphans.
Most conversations come in bursts.  And if I reply to you, my reply will
(modulo time errors) be later than your post.
Further, there's a pretty good chance that my subject will be "Re: yours".
So, if that bucket is sorted by subject (primary key) and date
(secondary), there's a fair approximation that within the set of
orphans, they'll be near each-other.  Which is better than nothing.
You're right that if my client generates good headers and yours doesn't,
yours will be orphaned.  But at least all your replies will share a
subject, so in that bucket will be your side of the conversation,
roughly in order.  And if you quote my text when replying, both sides.
If several people have clients that orphan their replies, they all will
cluster because of the subject.  And be in rough time order.
You do want to skip any [list] tag and [Re:]* because they don't change
the subject.
These are heuristics - you might come up with better ones.  The idea is
that they don't rely on or try to fix any particular fault in the
client's posts.  Either a message conforms, or it doesn't.  If it does,
it gets threaded as everyone expects.  If it doesn't , it's an orphan,
and there's one place to find it.  And with a modest effort, this should
keep related ones close to each-other.  It's better than just keeping
the orphans sorted by date - the subject tends to be stable.  But it's
not perfect.  E.g. the not-uncomon 'Re: dogs are better (was cats rule)'
won't keep the thread together.
But it seems better than nothing, without endless improvement.  Once the
afflicted users learn that messages they post are put into the Orphans
thread, they can complain to their clients' developers.  Meantime, life
should be tolerable - but annoying enough for them to keep the pressure
on for a fix :-)
...
If it's an unusable munged reference in References, the munged
reference may be visible as a placeholder (no real message) in a
separate subthread, or pruned (and invisible) because no real message
is indicated by it.  The indicated message will be in a separate
subthread if it has a valid References (including no References, in
which case it will be a thread root).  And, of course, if it's not the
immediate parent, other messages' References fields will likely allow
the message to be threaded corrects.  AFAICS "stripping and cleaning"
an invalid msg-id is highly unlikely to duplicate a valid msg-id
associated with a different message, athough there's a good chance it
won't allow identification of any message at all -- which is where we
started.
Yes, and that's why I think it's wasted effort.  Opaque means "Opaque". 
If you don't have one that conforms, use something else, and use the
imperfections to get the bug(s) fixed.
At least with an orphans bucket, you don't end up with "invisible"
messages.  They're just not where you expected.  Like the SPAM folder
:-) You know where to look, and you know that anything in there means
there's a bug to fix.

[MM3-users] Re: Implementing threading [was: hyperkitty failed to create a thread]

tlhackque