search this list searches more then just this list
Hi there,
We run the following versions:
mailman-core: 3.3.9 mailman-core-api: 3.1 mailman-core-python: 3.11.6 hyperkitty: 1.3.8
We have several lists on our server, I will write down a list of example list names that are good enough for the purpose of my question.
xx@domain.tld <mailto:xx@domain.tld> yy-xx@domain.tld <mailto:yy-xx@domain.tld> zz-xx@domain.tld <mailto:zz-xx@domain.tld>
When I now visit the archives for xx@domain.tld <mailto:xx@domain.tld> and enter a search phrase in “search this list”, the results include the other 2 lists.
It looks like “search this list” searches for keyword in any listname that matches the one from "this list".
Is there a way to really only search “this list” when I go to the archives for xx@domain.tld <mailto:xx@domain.tld>?
I have tried to change the url for the results to have a ‘^’ in front of the listname, but that gives an error :-)
Thanks in advance!
Marco van Tol
Marco van Tol writes:
We run the following versions:
mailman-core: 3.3.9 mailman-core-api: 3.1 mailman-core-python: 3.11.6 hyperkitty: 1.3.8
It looks like “search this list” searches for keyword in any listname that matches the one from "this list".
The code says it should match the mailing list name exactly. However, keyword search is not a database function, it's a function of the full-text indexer. HyperKitty uses Django's Haystack function to implement full-text indexed search, which in turn has several backends available (whoosh, xapian, elastic at least).
Which search backend are you using?
Op 15 jan 2024, om 16:37 heeft Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> het volgende geschreven:
Marco van Tol writes:
We run the following versions:
mailman-core: 3.3.9 mailman-core-api: 3.1 mailman-core-python: 3.11.6 hyperkitty: 1.3.8
It looks like “search this list” searches for keyword in any listname that matches the one from "this list".
The code says it should match the mailing list name exactly. However, keyword search is not a database function, it's a function of the full-text indexer. HyperKitty uses Django's Haystack function to implement full-text indexed search, which in turn has several backends available (whoosh, xapian, elastic at least).
Which search backend are you using?
Hi, thank you for your message. I dug into what we have, and found this:
HAYSTACK_CONNECTIONS = {
'default': {
'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
Marco van Tol
On 1/15/24 08:25, Marco van Tol wrote:
Hi, thank you for your message. I dug into what we have, and found this:
HAYSTACK_CONNECTIONS = { 'default': { 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
I have tested and I can confirm that with two lists in the same domain
named test
and x-test
and with Whoosh as the backend searching the
test
list returns hits from both lists, but with Xapian as the
backend, only hits from the test
list are returned.
Apparently when Whoosh is given a list name like test@example.com
it
finds hits from all lists whose names contain test@example.com
. This
is not the case with Xapian.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro writes:
I have tested and I can confirm that with two lists in the same domain named
test
andx-test
and with Whoosh as the backend searching thetest
list returns hits from both lists, but with Xapian as the backend, only hits from thetest
list are returned.
Wow! Thanks, Mark!
Steve
Yes, thank you! I’ll have a look into changing our backend to Xapian.
Hopefully it will be trivial :)
Marco van Tol
Op 16 jan 2024, om 09:20 heeft Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> het volgende geschreven:
Mark Sapiro writes:
I have tested and I can confirm that with two lists in the same domain named
test
andx-test
and with Whoosh as the backend searching thetest
list returns hits from both lists, but with Xapian as the backend, only hits from thetest
list are returned.Wow! Thanks, Mark!
Steve
Mailman-users mailing list -- mailman-users@mailman3.org To unsubscribe send an email to mailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/ Archived at: https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/...
This message sent to mvantol@ripe.net
Op 18 jan 2024, om 12:17 heeft Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> het volgende geschreven:
Marco van Tol writes:
(Sorry for the top-quote)
Awwwwww....
Here I was hoping for a success report! ;-)
Heh, sorry :)
It took me a while to find time to try it, but I have just now.
I realize I might just need to be patient to wait for the next cron iteration, but here’s what I currently see happening.
I use containers based on those made by maxking, which luckily already install xapian.
So in settings.py I changed what was there to what was suggested by the documentation pointed to by Odhiambo, adapted for the paths the container can use:
HAYSTACK_CONNECTIONS = { 'default': { 'PATH': "/opt/mailman-web-data/xapian_index", 'ENGINE': 'xapian_backend.XapianEngine' }, }
When I restart the container the path gets mkdir’ed, but nothing appears in it, and searches give an error.
The logs have this key error message:
ERROR 2024-01-22 15:27:34,540 25 django.request Internal Server Error: /hyperkitty/search Traceback (most recent call last): File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 1170, in _database database = xapian.Database(self.path) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/xapian/__init__.py", line 3665, in __init__ _xapian.Database_swiginit(self, _xapian.new_Database(*args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ xapian.DatabaseNotFoundError: Couldn't detect type of database
During handling of the above exception, another exception occurred:
[...]
File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 1172, in _database raise InvalidIndexError('Unable to open index at %s' % self.path) xapian_backend.InvalidIndexError: Unable to open index at /opt/mailman-web-data/xapian_index
I’ll dig a bit further to see if I can fix this, but figured I sort of owed you an update :-)
Marco van Tol
Op 22 jan 2024, om 16:30 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 18 jan 2024, om 12:17 heeft Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> het volgende geschreven:
Marco van Tol writes:
(Sorry for the top-quote)
Awwwwww....
Here I was hoping for a success report! ;-)
Heh, sorry :)
It took me a while to find time to try it, but I have just now.
I realize I might just need to be patient to wait for the next cron iteration, but here’s what I currently see happening.
I use containers based on those made by maxking, which luckily already install xapian.
So in settings.py I changed what was there to what was suggested by the documentation pointed to by Odhiambo, adapted for the paths the container can use:
HAYSTACK_CONNECTIONS = { 'default': { 'PATH': "/opt/mailman-web-data/xapian_index", 'ENGINE': 'xapian_backend.XapianEngine' }, }
When I restart the container the path gets mkdir’ed, but nothing appears in it, and searches give an error.
[...]
Turns out I was just one ./manage.py rebuild_index
away, so all good :-)
Thanks!
Marco van Tol
Op 22 jan 2024, om 17:19 heeft Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> het volgende geschreven:
Marco van Tol writes:
Turns out I was just one
./manage.py rebuild_index
away, so all good :-)Wonderful news! Thanks for following up, we do love to hear success stories.
Small unfortunate twist, this leads to:
[ERROR/MainProcess] Failed indexing 204001 - 205000 (retry 5/5): Term too long (> 245): ...
I have seen comments on that from back in 2020, but what’s the latest advise on how to deal with this?
Thanks so much in advance!
Marco van Tol
On 1/23/24 05:49, Marco van Tol wrote:
Small unfortunate twist, this leads to:
[ERROR/MainProcess] Failed indexing 204001 - 205000 (retry 5/5): Term too long (> 245): ...
I have seen comments on that from back in 2020, but what’s the latest advise on how to deal with this?
This issue is discussed at <https://github.com/notanumber/xapian-haystack/pull/181>. There are two patches for this issue. They are somewhat different approaches, but either one is OK. One patch is the one in the above PR <https://github.com/notanumber/xapian-haystack/pull/181/files>. The other is at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6>.
I use the one from the PR, but it requires a substitution
s/force_text/str/
.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Op 23 jan 2024, om 19:09 heeft Mark Sapiro <mark@msapiro.net> het volgende geschreven:
On 1/23/24 05:49, Marco van Tol wrote:
Small unfortunate twist, this leads to:
[ERROR/MainProcess] Failed indexing 204001 - 205000 (retry 5/5): Term too long (> 245): ...
I have seen comments on that from back in 2020, but what’s the latest advise on how to deal with this?
This issue is discussed at <https://github.com/notanumber/xapian-haystack/pull/181>. There are two patches for this issue. They are somewhat different approaches, but either one is OK. One patch is the one in the above PR <https://github.com/notanumber/xapian-haystack/pull/181/files>. The other is at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6>.
I use the one from the PR, but it requires a substitution
s/force_text/str/
.
Hi, thanks! Just to double check, do you mean str
or force_str
in the substitution?
I found a comment on this topic regarding django-4 where they advice to use force_str. https://stackoverflow.com/questions/70382084/import-error-force-text-from-dj...
Thanks!
Marco van Tol
Op 24 jan 2024, om 12:24 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 23 jan 2024, om 19:09 heeft Mark Sapiro <mark@msapiro.net <mailto:mark@msapiro.net>> het volgende geschreven:
On 1/23/24 05:49, Marco van Tol wrote:
Small unfortunate twist, this leads to:
[ERROR/MainProcess] Failed indexing 204001 - 205000 (retry 5/5): Term too long (> 245): ...
I have seen comments on that from back in 2020, but what’s the latest advise on how to deal with this?
This issue is discussed at <https://github.com/notanumber/xapian-haystack/pull/181>. There are two patches for this issue. They are somewhat different approaches, but either one is OK. One patch is the one in the above PR <https://github.com/notanumber/xapian-haystack/pull/181/files>. The other is at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6>.
I use the one from the PR, but it requires a substitution
s/force_text/str/
.Hi, thanks! Just to double check, do you mean
str
orforce_str
in the substitution?I found a comment on this topic regarding django-4 where they advice to use force_str. https://stackoverflow.com/questions/70382084/import-error-force-text-from-dj...
Hm, never mind, sorry. I now see the current file already has str() rather than force_text() in it.
Marco
On Wed, Jan 17, 2024 at 11:13 AM Marco van Tol <mvantol@ripe.net> wrote:
Yes, thank you! I’ll have a look into changing our backend to Xapian.
Hopefully it will be trivial :)
With the virtualenv install method, it is trivial actually:
https://docs.mailman3.org/en/latest/install/virtualenv.html#setting-up-fullt...
-- Best regards, Odhiambo WASHINGTON, Nairobi,KE +254 7 3200 0004/+254 7 2274 3223 "Oh, the cruft.", egrep -v '^$|^.*#' ¯\_(ツ)_/¯ :-) [How to ask smart questions: http://www.catb.org/~esr/faqs/smart-questions.html]
Op 25 jan 2024, om 14:16 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 24 jan 2024, om 13:46 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 24 jan 2024, om 12:24 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 23 jan 2024, om 19:09 heeft Mark Sapiro <mark@msapiro.net <mailto:mark@msapiro.net>> het volgende geschreven:
On 1/23/24 05:49, Marco van Tol wrote:
Small unfortunate twist, this leads to:
[ERROR/MainProcess] Failed indexing 204001 - 205000 (retry 5/5): Term too long (> 245): ...
I have seen comments on that from back in 2020, but what’s the latest advise on how to deal with this?
This issue is discussed at <https://github.com/notanumber/xapian-haystack/pull/181>. There are two patches for this issue. They are somewhat different approaches, but either one is OK. One patch is the one in the above PR <https://github.com/notanumber/xapian-haystack/pull/181/files>. The other is at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6>.
I use the one from the PR, but it requires a substitution
s/force_text/str/
.Hi, thanks! Just to double check, do you mean
str
orforce_str
in the substitution?I found a comment on this topic regarding django-4 where they advice to use force_str. https://stackoverflow.com/questions/70382084/import-error-force-text-from-dj...
Hm, never mind, sorry. I now see the current file already has str() rather than force_text() in it.
Okay, so, I got a bit further, but something still gets stuck.
Here’s what I did. Keep in mind I’m using containers that are built from some CI/CD pipeline, so I updated the pipeline to apply the patch attached to this email to
/usr/lib/python3.11/site-packages/xapian_backend.py
.Before I had 2 list servers with the “Term too long” issue, 1 got resolved by this, and the other did not. I opened a shell in the newly deployed container to confirm the patch was applied in it.
The other attachment to this email is a copy/paste from the full error from
./manage.py rebuild_index
.Is there something else special in the email that makes it choke that evades the xapian patch?
Thank you very much in advance!
I tried to change to ‘hash’, but the code in that bit of the function has not been tested enough.
For example hole = sha224(hole.encode('utf8')).hexdigest()
comes back with that the bytes object hole does not have an encode() method.
When I change it to hole = sha224(hole).hexdigest()
, the following error is:
text = text[:match.start()] + hole + text[match.end():] TypeError: can't concat str to bytes
The ‘hash’ part of that function needs some debugging.
Marco
Marco van Tol writes:
I tried to change to ‘hash’, but the code in that bit of the function has not been tested enough.
For example
hole = sha224(hole.encode('utf8')).hexdigest()
comes back with that the bytes object hole does not have an encode() method. When I change it tohole = sha224(hole).hexdigest()
, the following error is:text = text[:match.start()] + hole + text[match.end():] TypeError: can't concat str to bytes
The ‘hash’ part of that function needs some debugging.
This is all deep in Django/Haystack/Xapian. You will get better advice there.
On 1/26/24 02:44, Marco van Tol wrote:
Op 25 jan 2024, om 14:16 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Okay, so, I got a bit further, but something still gets stuck.
Here’s what I did. Keep in mind I’m using containers that are built from some CI/CD pipeline, so I updated the pipeline to apply the patch attached to this email to
/usr/lib/python3.11/site-packages/xapian_backend.py
.Before I had 2 list servers with the “Term too long” issue, 1 got resolved by this, and the other did not. I opened a shell in the newly deployed container to confirm the patch was applied in it.
The other attachment to this email is a copy/paste from the full error from
./manage.py rebuild_index
.Is there something else special in the email that makes it choke that evades the xapian patch?
Thank you very much in advance!
The message above with the attached "full error" never got to the list. What is the error report?
I tried to change to ‘hash’, but the code in that bit of the function has not been tested enough.
For example
hole = sha224(hole.encode('utf8')).hexdigest()
comes back with that the bytes object hole does not have an encode() method. When I change it tohole = sha224(hole).hexdigest()
, the following error is:
That's only part of it. You need
hole = sha224(hole).hexdigest().encode('utf8')
text = text[:match.start()] + hole + text[match.end():] TypeError: can't concat str to bytes
The ‘hash’ part of that function needs some debugging.
Yes, presumably because no one sets XAPIAN_LONG_TERM_METHOD=hash
in
the environment. Do you have a reason for this?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 1/26/24 13:55, Mark Sapiro wrote:
Yes, presumably because no one sets
XAPIAN_LONG_TERM_METHOD=hash
in the environment. Do you have a reason for this?
Actually, it's not an enviroment setting. It's a django setting
XAPIAN_LONG_TERM_METHOD = 'hash'
but in any case, I don't know why anyone would want that. What it does is rather than truncate a long term in the index, it replaces that term with a sha224 hash of the term, but that raises the question of how/why one would be searching for the hash rather than the term.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Op 26 jan 2024, om 22:55 heeft Mark Sapiro <mark@msapiro.net> het volgende geschreven:
On 1/26/24 02:44, Marco van Tol wrote:
Op 25 jan 2024, om 14:16 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Okay, so, I got a bit further, but something still gets stuck.
Here’s what I did. Keep in mind I’m using containers that are built from some CI/CD pipeline, so I updated the pipeline to apply the patch attached to this email to
/usr/lib/python3.11/site-packages/xapian_backend.py
.Before I had 2 list servers with the “Term too long” issue, 1 got resolved by this, and the other did not. I opened a shell in the newly deployed container to confirm the patch was applied in it.
The other attachment to this email is a copy/paste from the full error from
./manage.py rebuild_index
.Is there something else special in the email that makes it choke that evades the xapian patch?
Thank you very much in advance!
The message above with the attached "full error" never got to the list. What is the error report?
Hm, I see. Not sure why. The reply I got from mail.mailman3.org <http://mail.mailman3.org/> at 2024-01-25 13:16:25.486 UTC was: "250 2.0.0 Ok: 12772 bytes queued as 55AFD105C02”
I’m pasting it at the bottom of this email. Sorry it didn’t come through.
I tried to change to ‘hash’, but the code in that bit of the function has not been tested enough. For example
hole = sha224(hole.encode('utf8')).hexdigest()
comes back with that the bytes object hole does not have an encode() method. When I change it tohole = sha224(hole).hexdigest()
, the following error is:That's only part of it. You need
hole = sha224(hole).hexdigest().encode('utf8')
I ended up changing it to this, which fixed that bit, and led to the next issue. :)
text = text[:match.start()] + hole + text[match.end():] TypeError: can't concat str to bytes The ‘hash’ part of that function needs some debugging.
Yes, presumably because no one sets
XAPIAN_LONG_TERM_METHOD=hash
in the environment. Do you have a reason for this?
I wanted to check and see if I could avoid the “Term too long (>245)" issue this way, but I haven’t gotten to the point where xapian is successful.
Right now I’m back at the original issue as I see no other solution than to go back to whoosh.
Thanks!
Marco van Tol
Paste:
Indexing 194620 emails [ERROR/MainProcess] Failed indexing 156001 - 157000 (retry 5/5): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... (pid 32): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... Traceback (most recent call last): File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 119, in do_update backend.update(index, current_qs, commit=commit) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 98, in wrapper func(self, *args, **kwargs) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 505, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... [ERROR/MainProcess] Error updating hyperkitty using default Traceback (most recent call last): File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 297, in handle self.update_backend(label, using) File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 342, in update_backend max_pk = do_update( ^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 119, in do_update backend.update(index, current_qs, commit=commit) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 98, in wrapper func(self, *args, **kwargs) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 505, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... Traceback (most recent call last): File "/opt/mailman-web/./manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/usr/lib/python3.11/site-packages/django/core/management/__init__.py", line 446, in execute_from_command_line utility.execute() File "/usr/lib/python3.11/site-packages/django/core/management/__init__.py", line 440, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/lib/python3.11/site-packages/django/core/management/base.py", line 402, in run_from_argv self.execute(*args, **cmd_options) File "/usr/lib/python3.11/site-packages/django/core/management/base.py", line 448, in execute output = self.handle(*args, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/haystack/management/commands/rebuild_index.py", line 65, in handle call_command("update_index", **update_options) File "/usr/lib/python3.11/site-packages/django/core/management/__init__.py", line 198, in call_command return command.execute(*args, **defaults) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/django/core/management/base.py", line 448, in execute output = self.handle(*args, **options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 297, in handle self.update_backend(label, using) File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 342, in update_backend max_pk = do_update( ^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/haystack/management/commands/update_index.py", line 119, in do_update backend.update(index, current_qs, commit=commit) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 98, in wrapper func(self, *args, **kwargs) File "/usr/lib/python3.11/site-packages/xapian_backend.py", line 505, in update database.replace_document(document_id, document) xapian.InvalidArgumentError: Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62...
On 1/29/24 03:09, Marco van Tol wrote:
Right now I’m back at the original issue as I see no other solution than to go back to whoosh.
Thanks!
Marco van Tol
Paste:
Indexing 194620 emails [ERROR/MainProcess] Failed indexing 156001 - 157000 (retry 5/5): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... (pid 32): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62...
I'm not sure why the patch you are using doesn't avoid this, but you could try the other patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6> instead. If that fails, you could always find the offending message in the database maybe with a query like
SELECT * FROM hyperkitty_email Where subject LIKE
'http://www.google.com/url?q=\%68\%74\%74%';
and modify or delete it - it's probably spam anyway.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Op 29 jan 2024, om 18:29 heeft Mark Sapiro <mark@msapiro.net> het volgende geschreven:
On 1/29/24 03:09, Marco van Tol wrote:
Right now I’m back at the original issue as I see no other solution than to go back to whoosh. Thanks! Marco van Tol Paste:
Indexing 194620 emails [ERROR/MainProcess] Failed indexing 156001 - 157000 (retry 5/5): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... (pid 32): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62...
I'm not sure why the patch you are using doesn't avoid this, but you could try the other patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6> instead. If that fails, you could always find the offending message in the database maybe with a query like
SELECT * FROM hyperkitty_email Where subject LIKE 'http://www.google.com/url?q=\%68\%74\%74%';
and modify or delete it - it's probably spam anyway.
I tried the other patch, which looked very promising until something that had come in over SMTP some day threw a spanner in the wheels:
[ERROR/MainProcess] Failed indexing 287001 - 288000 (retry 5/5): Term too long (> 245): XTEXTº@[åeèúp¢i*h{õimô;]>ò&žôyþiýã#dzç8"¹ë=
;æmyš€vqe.âés:æä>üzúõœ'âú·ž[]kzñ-µ€3æfdñù£8çô<bœkkñ/ãžæjïw¿òþp-ùšã7/'ûvksqé (pid 404): Term too long (> 245): XTEXTº@[
åeèúp¢i*h{õimô;]>ò&žôyþiýã#dzç8"¹ë=;æmyš€vqe.âés:æä>üzúõœ'âú·ž[]kzñ-µ€3æfdñù£8çô<bœkkñ/ãžæjïw¿òþp-ùšã7/'
ûvksqé
I agree that this is very likely the result of some spam, but the main point is that something that comes in over SMTP shouldn’t cause manual work on the listserver side.
I’ll try it with a maximum length of 122 which should never explode into more than 244 bytes, unless I overlook something.
I really wish Xapian would have a reliable fix for this.
Will keep you posted.
Marco van Tol
Op 5 feb 2024, om 15:45 heeft Marco van Tol <mvantol@ripe.net> het volgende geschreven:
Op 29 jan 2024, om 18:29 heeft Mark Sapiro <mark@msapiro.net <mailto:mark@msapiro.net>> het volgende geschreven:
On 1/29/24 03:09, Marco van Tol wrote:
Right now I’m back at the original issue as I see no other solution than to go back to whoosh. Thanks! Marco van Tol Paste:
Indexing 194620 emails [ERROR/MainProcess] Failed indexing 156001 - 157000 (retry 5/5): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62... (pid 32): Term too long (> 245): XSUBJECThttp://www.google.com/url?q=%68%74%74%70%73%3a%2f%2f%68%64%72%65%64%74%75%62...
I'm not sure why the patch you are using doesn't avoid this, but you could try the other patch at <https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d487e7af79b57ee17a6> instead. If that fails, you could always find the offending message in the database maybe with a query like
SELECT * FROM hyperkitty_email Where subject LIKE 'http://www.google.com/url?q=\%68\%74\%74%';
and modify or delete it - it's probably spam anyway.
I tried the other patch, which looked very promising until something that had come in over SMTP some day threw a spanner in the wheels:
[ERROR/MainProcess] Failed indexing 287001 - 288000 (retry 5/5): Term too long (> 245): XTEXTº@[åeèúp¢i*h{õimô;]>ò&žôyþiýã#dzç8"¹ë= ;æmyš€vqe.âés:æä>üzúõœ'âú·ž[]kzñ-µ€3æfdñù£8çô<bœkkñ/ãžæjïw¿òþp-ùšã7/'ûvksqé (pid 404): Term too long (> 245): XTEXTº@[ åeèúp¢i*h{õimô;]>ò&žôyþiýã#dzç8"¹ë=;æmyš€vqe.âés:æä>üzúõœ'âú·ž[]kzñ-µ€3æfdñù£8çô<bœkkñ/ãžæjïw¿òþp-ùšã7/' ûvksqé
I agree that this is very likely the result of some spam, but the main point is that something that comes in over SMTP shouldn’t cause manual work on the listserver side.
I’ll try it with a maximum length of 122 which should never explode into more than 244 bytes, unless I overlook something.
This ended up working. :-)
Summary: Patch from https://github.com/alexsilva/xapian-haystack/commit/a53523d2d0d13929a0729d48...
But with TERM_LENGTH_LIMIT = 122
And str() instead of force_str()
Thanks so much for all your efforts!
Marco van Tol
participants (4)
-
Marco van Tol
-
Mark Sapiro
-
Odhiambo Washington
-
Stephen J. Turnbull