Weird hourly indexing error
I moved my mail server to a Digital Ocean droplet. I basically installed all the appropriate packages and then rsync’d everything that matters from my metal server up to the cloud instance. Everything more or less works fine.
However, I am now getting this hourly error - see below.
It is ENTIRELY possible that I was getting this error on the previous server as well, but I didn’t have www-data aliased to root so I would never have seen it.
Anyone know what’s going on here?
BTW, Digital Ocean is pretty awesome. If anyone wants to try it, ping me for a referral code good for $100 in credit to play with.
- Mark
mark@pdc-racing.net | 408-348-2878
Begin forwarded message:
From: root@pdc-racing.net (Cron Daemon) Subject: Cron <www-data@mail> [ -f /usr/bin/django-admin ] && flock -n /var/run/mailman3-web/cron.hourly /usr/share/mailman3-web/manage.py runjobs hourly Date: May 5, 2020 at 5:01:00 AM PDT To: www-data@pdc-racing.net
[ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 36301): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 488, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
On 5/5/20 8:55 AM, Mark Dadgar wrote:
I moved my mail server to a Digital Ocean droplet. I basically installed all the appropriate packages and then rsync’d everything that matters from my metal server up to the cloud instance. Everything more or less works fine.
However, I am now getting this hourly error - see below. ... [ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 36301): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 488, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
This error comes from Xapian. It is due to a (new) message in your archive that contains a 'word' longer than 245 characters.
There is a PR at <https://github.com/notanumber/xapian-haystack/pull/181> that addresses this plus another suggested fix in that PR's comment thread.
Note: I got there by googling "xapian term too long" which gave <https://github.com/notanumber/xapian-haystack/issues/153> as the first hit and that led me to the above.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On May 5, 2020, at 9:21 AM, Mark Sapiro <mark@msapiro.net> wrote:
On 5/5/20 8:55 AM, Mark Dadgar wrote:
I moved my mail server to a Digital Ocean droplet. I basically installed all the appropriate packages and then rsync’d everything that matters from my metal server up to the cloud instance. Everything more or less works fine.
However, I am now getting this hourly error - see below. ... [ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 36301): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 488, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
This error comes from Xapian. It is due to a (new) message in your archive that contains a 'word' longer than 245 characters.
There is a PR at <https://github.com/notanumber/xapian-haystack/pull/181> that addresses this plus another suggested fix in that PR's comment thread.
Thanks. I guess I wait for them to roll the fix in and the ubuntu maintainers to pick it up.
Note: I got there by googling "xapian term too long" which gave <https://github.com/notanumber/xapian-haystack/issues/153> as the first hit and that led me to the above.
Yeah, that would have been smarter.
- Mark
mark@pdc-racing.net | 408-348-2878
On 5/5/20 9:33 AM, Mark Dadgar wrote:
Thanks. I guess I wait for them to roll the fix in and the ubuntu maintainers to pick it up.
You don't want to wait because I think currently your search index is not being updated due to the exception. I think you should just apply the patch at https://patch-diff.githubusercontent.com/raw/notanumber/xapian-haystack/pull...>. I installed it in a test instance and it seems fine.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On May 5, 2020, at 10:33 AM, Mark Sapiro <mark@msapiro.net> wrote:
On 5/5/20 9:33 AM, Mark Dadgar wrote:
Thanks. I guess I wait for them to roll the fix in and the ubuntu maintainers to pick it up.
You don't want to wait because I think currently your search index is not being updated due to the exception. I think you should just apply the patch at https://patch-diff.githubusercontent.com/raw/notanumber/xapian-haystack/pull...>. I installed it in a test instance and it seems fine.
Good point. I just installed it. We will see.
Thank you.
- Mark
mark@pdc-racing.net | 408-348-2878
On 5/5/20 8:55 AM, Mark Dadgar wrote:
ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413
It isn't clear from looking at it and I never would have noticed had I not happened to look at the raw email with less, but the XSUBJECT95251413 in the above is the entire long word. This is how less displays it
XSUBJECT9<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>2<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>4<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>3<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
Each of those <U+200B> represents a unicode zero width space.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On May 5, 2020, at 2:27 PM, Mark Sapiro <mark@msapiro.net> wrote:
On 5/5/20 8:55 AM, Mark Dadgar wrote:
ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413
It isn't clear from looking at it and I never would have noticed had I not happened to look at the raw email with less, but the XSUBJECT95251413 in the above is the entire long word. This is how less displays it
XSUBJECT9<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>2<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>4<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>3<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
Each of those <U+200B> represents a unicode zero width space.
That’s gnarly.
BTW, the patch did not fix it. I haven’t had a chance to look more deeply into it yet.
- Mark
mark@pdc-racing.net | 408-348-2878
On 5/5/20 2:39 PM, Mark Dadgar wrote:
On May 5, 2020, at 2:27 PM, Mark Sapiro <mark@msapiro.net> wrote:
This is how less displays it
XSUBJECT9<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>2<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>5<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>4<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>1<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>3<U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B><U+200B>
Each of those <U+200B> represents a unicode zero width space.
That’s gnarly.
BTW, the patch did not fix it. I haven’t had a chance to look more deeply into it yet.
Actually, if I take the above as a unicode string and encode it as utf-8, each of the <U+200B> unicodes becomes three hex bytes \xe2\x80\x88 and the length of the result is 247 bytes, so I suspect that the 'word' in the error message has been truncated.
As to the patch not working, there is a different patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d48...> which may work.
There is one issue with this patch. Namely it calls force_str() in the section
# https://trac.xapian.org/wiki/FAQ/UniqueIds#Workingroundthetermlengthlimit
# Working round the term length limit
word = force_str(word)
word_length = len(word)
but xapian_backend.py imports force_text <https://github.com/notanumber/xapian-haystack/blob/master/xapian_backend.py#...>. They are actually synonymous <https://docs.djangoproject.com/en/3.0/ref/utils/#module-django.utils.encodin...>, but just appliing the patch without changinf force_str to force_text won't work.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On May 5, 2020, at 4:01 PM, Mark Sapiro <mark@msapiro.net> wrote:
As to the patch not working, there is a different patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d48...> which may work.
There is one issue with this patch. Namely it calls force_str() in the section
# https://trac.xapian.org/wiki/FAQ/UniqueIds#Workingroundthetermlengthlimit
# Working round the term length limit
word = force_str(word)
word_length = len(word)
but xapian_backend.py imports force_text <https://github.com/notanumber/xapian-haystack/blob/master/xapian_backend.py#...>. They are actually synonymous <https://docs.djangoproject.com/en/3.0/ref/utils/#module-django.utils.encodin...>, but just appliing the patch without changinf force_str to force_text won't work.
The second patch did not work either. Ahh, well.
- Mark
mark@pdc-racing.net | 408-348-2878
On 5/12/20 3:02 PM, Mark Dadgar wrote:
The second patch did not work either. Ahh, well.
I think you need to do something like this:
$ mm/bin/django-admin shell Python 3.7.1 (default, Dec 14 2018, 13:17:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email emails = Email.objects.all() for email in emails: ... if email.content.find('\u200b'*5) >= 0: ... print('Found in ml {}, hash {}'.format( ... email.mailinglist.name, ... email.message_id_hash)) ... _NAME/message/HASH/
This will print the mailing list and message id hash for messages found that contain at least 5 U+200B Unicodes in a row. You may need to adjust the '\u200b'*5 if it doesn't find anything.
Once you know the list and hash you can go to something like https://example.com/hyperkitty/lists/LIST_NAME/message/HASH/ to see the message in the archive and delete it.
Before deleting it, you might want to 'download' it. I'd be interested in seeing that to see if I can understand the issue better and why the patches don't work to solve it.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On May 12, 2020, at 3:46 PM, Mark Sapiro <mark@msapiro.net> wrote:
On 5/12/20 3:02 PM, Mark Dadgar wrote:
The second patch did not work either. Ahh, well.
I think you need to do something like this:
$ mm/bin/django-admin shell Python 3.7.1 (default, Dec 14 2018, 13:17:01) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email emails = Email.objects.all() for email in emails: ... if email.content.find('\u200b'*5) >= 0: ... print('Found in ml {}, hash {}'.format( ... email.mailinglist.name, ... email.message_id_hash)) ... _NAME/message/HASH/
Finally getting around to looking into this now that I’ve migrated from sqlite3 to postgres (writeup on that coming soon for the archives).
Here’s what I get when I attempt the above commands:
root@mail:/usr/lib/mailman3/bin# sudo -u list django-admin shell Python 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email Traceback (most recent call last): File "/usr/lib/python3.8/code.py", line 90, in runcode exec(code, self.locals) File "<console>", line 1, in <module> File "/usr/lib/python3/dist-packages/hyperkitty/models/__init__.py", line 25, in <module> from .category import ThreadCategory File "/usr/lib/python3/dist-packages/hyperkitty/models/category.py", line 61, in <module> class ThreadCategory(models.Model): File "/usr/lib/python3/dist-packages/django/db/models/base.py", line 103, in __new__ app_config = apps.get_containing_app_config(module) File "/usr/lib/python3/dist-packages/django/apps/registry.py", line 252, in get_containing_app_config self.check_apps_ready() File "/usr/lib/python3/dist-packages/django/apps/registry.py", line 134, in check_apps_ready settings.INSTALLED_APPS File "/usr/lib/python3/dist-packages/django/conf/__init__.py", line 79, in __getattr__ self._setup(name) File "/usr/lib/python3/dist-packages/django/conf/__init__.py", line 60, in _setup raise ImproperlyConfigured( django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
I am in no way a django expert.
Thoughts?
- Mark
mark@pdc-racing.net | 408-348-2878
On 6/9/20 10:49 AM, Mark Dadgar wrote:
Here’s what I get when I attempt the above commands:
root@mail:/usr/lib/mailman3/bin# sudo -u list django-admin shell Python 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email Traceback (most recent call last):
...
File "/usr/lib/python3/dist-packages/django/conf/__init__.py", line 60, in _setup raise ImproperlyConfigured( django.core.exceptions.ImproperlyConfigured: Requested setting INSTALLED_APPS, but settings are not configured. You must either define the environment variable DJANGO_SETTINGS_MODULE or call settings.configure() before accessing settings.
The django-admin
command that is being run doesn't itself point to
your settings. In my case, I run django-admin via a wrapper that contains
#!/bin/bash
. /opt/mailman/mm/venv/bin/activate
cd /opt/mailman/mm
export PYTHONPATH=/opt/mailman/mm
export DJANGO_SETTINGS_MODULE=settings
django-admin $@
The critical things are the exports of PYTHONPATH and DJANGO_SETTINGS_MODULE. The former is the path to where your settings module lives and the latter is its name without the .py extension.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Jun 9, 2020, at 11:04 AM, Mark Sapiro <mark@msapiro.net> wrote:
The
django-admin
command that is being run doesn't itself point to your settings. In my case, I run django-admin via a wrapper that contains#!/bin/bash . /opt/mailman/mm/venv/bin/activate cd /opt/mailman/mm export PYTHONPATH=/opt/mailman/mm export DJANGO_SETTINGS_MODULE=settings django-admin $@
The critical things are the exports of PYTHONPATH and DJANGO_SETTINGS_MODULE. The former is the path to where your settings module lives and the latter is its name without the .py extension.
OK, making progress:
mark@mail:/usr/local/bin$ !! ./mm3-admin shell Python 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email emails = Email.objects.all() for email in emails: ... if email.content.find('\u200b'*5) >= 0: ... print('Found in ml {}, hash {}'.format( ... email.mailinglist.name, ... email.message_id_hash)) ... ./mm3-admin: line 6: 702058 Killed django-admin $@
Now what?
- Mark
mark@pdc-racing.net | 408-348-2878
On 6/9/20 11:29 AM, Mark Dadgar wrote:
OK, making progress:
mark@mail:/usr/local/bin$ !! ./mm3-admin shell Python 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email emails = Email.objects.all() for email in emails: ... if email.content.find('\u200b'*5) >= 0: ... print('Found in ml {}, hash {}'.format( ... email.mailinglist.name, ... email.message_id_hash)) ... ./mm3-admin: line 6: 702058 Killed django-admin $@
Now what?
Run it as user list
.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Jun 9, 2020, at 11:34 AM, Mark Sapiro <mark@msapiro.net> wrote:
./mm3-admin: line 6: 702058 Killed django-admin $@
Now what?
Run it as user
list
.
Same.
root@mail:/usr/local/bin# sudo -u list ./mm3-admin shell Python 3.8.2 (default, Apr 27 2020, 15:53:34) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole)
from hyperkitty.models import Email emails = Email.objects.all() for email in emails: ... if email.content.find('\u200b'*5) >= 0: ... print('Found in ml {}, hash {}'.format( ... email.mailinglist.name, ... email.message_id_hash)) ... ./mm3-admin: line 6: 704575 Killed django-admin $@
- Mark
mark@pdc-racing.net | 408-348-2878
On 6/9/20 11:57 AM, Mark Dadgar wrote:
On Jun 9, 2020, at 11:34 AM, Mark Sapiro <mark@msapiro.net> wrote:
./mm3-admin: line 6: 702058 Killed django-admin $@
Now what?
Run it as user
list
.Same.
It appears that the reference to line 6 is line 6 of your mm3-admin file
which is the django-admin $@
command. So that command is being killed
by the OS for some reason, probably out of memory.
Try this instead.
import pytz import datetime start = datetime.datetime(2020, 1, 1).replace(tzinfo=pytz.UTC) end = datetime.datetime(2020, 2, 1).replace(tzinfo=pytz.UTC) # These are examples. The args are year, month, day. # Pick dates for start and end that overlap the start of this issue. from hyperkitty.models import Email emails = Email.objects.filter(date__range=(start,end)) for email in emails: if email.content.find('\u200b'*5) >= 0: print('Found in ml {}, hash {}'.format( email.mailinglist.name, email.message_id_hash))
This will make a much smaller query set than all the emails and should work.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Jun 9, 2020, at 1:23 PM, Mark Sapiro <mark@msapiro.net> wrote:
On Jun 9, 2020, at 11:34 AM, Mark Sapiro <mark@msapiro.net> wrote:
./mm3-admin: line 6: 702058 Killed django-admin $@
Now what?
Run it as user
list
.Same.
It appears that the reference to line 6 is line 6 of your mm3-admin file which is the
django-admin $@
command. So that command is being killed by the OS for some reason, probably out of memory.Try this instead.
import pytz import datetime start = datetime.datetime(2020, 1, 1).replace(tzinfo=pytz.UTC) end = datetime.datetime(2020, 2, 1).replace(tzinfo=pytz.UTC) # These are examples. The args are year, month, day. # Pick dates for start and end that overlap the start of this issue. from hyperkitty.models import Email emails = Email.objects.filter(date__range=(start,end)) for email in emails: if email.content.find('\u200b'*5) >= 0: print('Found in ml {}, hash {}'.format( email.mailinglist.name, email.message_id_hash))
This will make a much smaller query set than all the emails and should work.
I was just writing another email - I had a suspicion on this and I was right: this is the Linux OOM killer in action.
Jun 9 13:15:32 mail kernel: [1051406.859205] Out of memory: Killed process 712833 (python3) total-vm:2351156kB, anon-rss:2274112kB, file-rss:2236kB, shmem-rss:4kB, UID:38 pgtables:4616kB oom_score_adj:0 Jun 9 13:15:32 mail kernel: [1051407.042975] oom_reaper: reaped process 712833 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:4kB
I set up a quick swapfile (there are 581,000 emails in this archive) and it came back with some hashes. I am tracking them down now.
- Mark
mark@pdc-racing.net | 408-348-2878
On Jun 9, 2020, at 1:26 PM, Mark Dadgar <mark@pdc-racing.net> wrote:
I set up a quick swapfile (there are 581,000 emails in this archive) and it came back with some hashes. I am tracking them down now.
It turns out that the offending email was actually a spam email that snuck through the filters somehow and then a couple of smart-ass list members replied to it a couple of times to propagate the goodness.
I downloaded a copy (Mark, I will send it to you directly) and then deleted the thread.
THANK YOU for all the help.
- Mark
mark@pdc-racing.net | 408-348-2878
On 6/9/20 1:33 PM, Mark Dadgar wrote:
It turns out that the offending email was actually a spam email that snuck through the filters somehow and then a couple of smart-ass list members replied to it a couple of times to propagate the goodness.
I downloaded a copy (Mark, I will send it to you directly) and then deleted the thread.
Thanks. I'll see about defending against this.
THANK YOU for all the help.
I'm glad we finally got it resolved.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Jun 9, 2020, at 1:57 PM, Mark Sapiro <mark@msapiro.net> wrote:
THANK YOU for all the help.
I'm glad we finally got it resolved.
I may have spoken too soon:
[ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 729081): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 492, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
"XSUBJECT95251413“ has some serious weirdness embedded in it.
Not sure how to search for that in the django shell?
- Mark
mark@pdc-racing.net | 408-348-2878
On 6/9/20 9:29 PM, Mark Dadgar wrote:
On Jun 9, 2020, at 1:57 PM, Mark Sapiro <mark@msapiro.net> wrote:
THANK YOU for all the help.
I'm glad we finally got it resolved.
I may have spoken too soon:
[ERROR/MainProcess] Failed indexing 1 - 1000 (retry 5/5): Term too long (> 245): XSUBJECT95251413 (pid 729081): Term too long (> 245): XSUBJECT95251413 Traceback (most recent call last): File "/usr/lib/python3/dist-packages/haystack/management/commands/update_index.py", line 97, in do_update backend.update(index, current_qs, commit=commit) File "/usr/local/lib/python3.8/dist-packages/xapian_backend.py", line 492, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECT95251413
"XSUBJECT95251413“ has some serious weirdness embedded in it.
Not sure how to search for that in the django shell?
I suspect the issue here is the 'weirdness' is in the Subject: rather than the message body. The script we were using only looked at the body.
The message you sent me had a Subject which visually looked like [PDC] MEGAMillions- Ticket №7383246 9
but the № and the digits were interspersed with
strings of unicode zero width spaces, so the decoded Subject: was
actually '[PDC] MEGAMillions- Ticket
№7\u200b\u200b\u200b\u200b\u200b\u200b\u200b3\u200b\u200b\u200b\u200b\u200b\u200b\u200b8\u200b\u200b\u200b\u200b\u200b\u200b\u200b3\u200b\u200b\u200b\u200b\u200b\u200b\u200b2\u200b\u200b\u200b\u200b\u200b\u200b\u200b4\u200b\u200b\u200b\u200b\u200b\u200b\u200b6\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b9\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b\u200b'
where all the \u200b unicodes are the zero width spaces.
I suspect your current issue is with the Subject:. Actually, looking back at the thread, the prior one probably was too, but we also found messages that had this in the body.
To alter the specific script to look at the Subject: just change the line
if email.content.find('\u200b'*5) >= 0:
to
if email.subject.find('\u200b'*5) >= 0:
There's probably a reply there somewhere that has the Subject: but didn't quote the bad stuff in the body.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Jun 10, 2020, at 11:33 AM, Mark Sapiro <mark@msapiro.net> wrote:
I suspect the issue here is the 'weirdness' is in the Subject: rather than the message body. The script we were using only looked at the body. [snip] There's probably a reply there somewhere that has the Subject: but didn't quote the bad stuff in the body.
That fixed it. It took a couple of passes with various multiples of zero-width spaces, but the archive finally completely indexed.
Thanks!
- Mark
mark@pdc-racing.net | 408-348-2878
Mark Dadgar wrote:
BTW, Digital Ocean is pretty awesome. If anyone wants to try it, ping me for a referral code good for $100 in credit to play with.
DigitalOcean also hosts the server that supports all the @python.org email including hundreds of Mailman 2 and 3 lists. You can also get a credit with DigitalOcean by following the link at the bottom of <https://mail.python.org/mailman/listinfo/>.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 5/13/20 3:24 PM, Mark Sapiro wrote:
DigitalOcean also hosts the server that supports all the @python.org email including hundreds of Mailman 2 and 3 lists. You can also get a credit with DigitalOcean by following the link at the bottom of <https://mail.python.org/mailman/listinfo/>.
How can I get the Mailman developers to promote my Mailman hosting services?
-- Please let me know if you need further assistance.
Thank you for your business. We appreciate our clients. Brian Carpenter EMWD.com
-- EMWD's Knowledgebase: https://clientarea.emwd.com/index.php/knowledgebase
EMWD's Community Forums http://discourse.emwd.com/
On May 13, 2020, at 12:24 PM, Mark Sapiro <mark@msapiro.net> wrote:
Mark Dadgar wrote:
BTW, Digital Ocean is pretty awesome. If anyone wants to try it, ping me for a referral code good for $100 in credit to play with.
DigitalOcean also hosts the server that supports all the @python.org email including hundreds of Mailman 2 and 3 lists. You can also get a credit with DigitalOcean by following the link at the bottom of <https://mail.python.org/mailman/listinfo/>.
Yes! Use the Python.org <http://python.org/> link instead. That’s better use of the referral credit!
- Mark
mark@pdc-racing.net | 408-348-2878
participants (3)
-
Brian Carpenter
-
Mark Dadgar
-
Mark Sapiro