On 5/5/20 2:39 PM, Mark Dadgar wrote:
On May 5, 2020, at 2:27 PM, Mark Sapiro <mark@msapiro.net> wrote:
This is how less displays it
XSUBJECT9<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>2<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>4<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>3<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
Each of those <U+200B> represents a unicode zero width space.
That’s gnarly.
BTW, the patch did not fix it. I haven’t had a chance to look more deeply into it yet.
Actually, if I take the above as a unicode string and encode it as utf-8, each of the <U+200B> unicodes becomes three hex bytes \xe2\x80\x88 and the length of the result is 247 bytes, so I suspect that the 'word' in the error message has been truncated.
As to the patch not working, there is a different patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d48...> which may work.
There is one issue with this patch. Namely it calls force_str() in the section
# https://trac.xapian.org/wiki/FAQ/UniqueIds#Workingroundthetermlengthlimit
# Working round the term length limit
word = force_str(word)
word_length = len(word)
but xapian_backend.py imports force_text <https://github.com/notanumber/xapian-haystack/blob/master/xapian_backend.py#...>. They are actually synonymous <https://docs.djangoproject.com/en/3.0/ref/utils/#module-django.utils.encodin...>, but just appliing the patch without changinf force_str to force_text won't work.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan