On 6/9/20 9:29 PM, Mark Dadgar wrote:
On Jun 9, 2020, at 1:57 PM, Mark Sapiro <mark@msapiro.net> wrote:
THANK YOU for all the help.
I'm glad we finally got it resolved.
I may have spoken too soon:
[ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 729081): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 492, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
"XSUBJECT95251413“ has some serious weirdness embedded in it.
Not sure how to search for that in the django shell?
I suspect the issue here is the 'weirdness' is in the Subject: rather than the message body. The script we were using only looked at the body.
The message you sent me had a Subject which visually looked like [PDC] MEGAMillions- Ticket №7383246 9
but the № and the digits were interspersed with
strings of unicode zero width spaces, so the decoded Subject: was
actually '[PDC] MEGAMillions- Ticket
№7\u200b\u200b\u200b\u200b\u200b\u200b\u200b3\u200b\u200b\u200b\u200b\u200b\u200b\u200b8\u200b\u200b\u200b\u200b\u200b\u200b\u200b3\u200b\u200b\u200b\u200b\u200b\u200b\u200b2\u200b\u200b\u200b\u200b\u200b\u200b\u200b4\u200b\u200b\u200b\u200b\u200b\u200b\u200b6\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b9\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b'
where all the \u200b unicodes are the zero width spaces.
I suspect your current issue is with the Subject:. Actually, looking back at the thread, the prior one probably was too, but we also found messages that had this in the body.
To alter the specific script to look at the Subject: just change the line
if email.content.find('\u200b'*5) >= 0:
to
if email.subject.find('\u200b'*5) >= 0:
There's probably a reply there somewhere that has the Subject: but didn't quote the bad stuff in the body.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan