First time running runjobs monthly since I installed it a few days ago, and it's been maxing out I/O for the last 8 hours. There doesn't appear to be any logging going on. I am using Xapian, by the way.
Is this normal? Is this going to happen every month, or is the first month special?
-- Paul Tomblin
Paul Tomblin via Mailman-users writes:
First time running runjobs monthly since I installed it a few days ago, and it's been maxing out I/O for the last 8 hours. There doesn't appear to be any logging going on. I am using Xapian, by the way.
Is this normal? Is this going to happen every month, or is the first month special?
Full-text indexing is a pretty slow operation, and it's I/O bound. If you've migrated a substantial archive, that will be a full-archive index. Anyway, my experience is that for a multi-terabyte archive on a 4 vCPU 16 GB dedicated Linode it was still chugging away a couple weeks later, using all the I/O it could. (The client said "OK, we're satisfied, we'll call if there are problems" before it completed. It may still be at it for all I know. :-)
I think that normally there's a partial reindex once a month (because of the asynchronous nature of email, referenced messages can appear after the current message gets indexed). But if you didn't manually trigger a full archive index at the initial migration, the first monthly will do the whole thing. Unless you're literally archiving terabytes per month, later monthlies should take much less time (but they'll use all the I/O you give them). I don't think Xapian allows you to throttle the indexer, and I don't think any shell's ulimit can throttle I/O, but I gather the Linux kernel can do it to some extent.
Steve
-- GNU Mailman consultant (installation, migration, customization) Sirius Open Source https://www.siriusopensource.com/ Software systems consulting in Europe, North America, and Japan
On Sun, Mar 1, 2026, at 5:43 AM, Stephen J. Turnbull wrote:
Paul Tomblin via Mailman-users writes:
Full-text indexing is a pretty slow operation, and it's I/O bound. If you've migrated a substantial archive, that will be a full-archive
Is 431,000+ messages a substantial archive?
after the current message gets indexed). But if you didn't manually trigger a full archive index at the initial migration, the first
I did a "update_index_one_list" on each archive after I brought it into hyperkitty. I was impressed with the full text search.
-- Paul Tomblin
On 3/1/26 05:10, Paul Tomblin via Mailman-users wrote:
I did a "update_index_one_list" on each archive after I brought it into hyperkitty. I was impressed with the full text search.
The HyperKitty update_and_clean_index job runs monthly and does the same thing as the hourly update_index job plus removing index entries for messages that have been deleted from HyperKitty. It is very long running and there are lock contention issues between it and the hourly update_index job. It's purpose is to drop entries for messages that have been removed. It's not critical that it runs at all. If you don't want to run it, you can apply a patch like: ``` --- a/hyperkitty/jobs/update_and_clean_index.py +++ b/hyperkitty/jobs/update_and_clean_index.py @@ -31,7 +31,7 @@ from hyperkitty.search_indexes import update_index class Job(BaseJob): help = "Update the full-text index and clean old entries" - when = "monthly" + when = "never" def execute(self): run_with_lock(update_index, remove=True) ``` -- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Sun, Mar 1, 2026, at 3:38 PM, Mark Sapiro wrote:
On 3/1/26 05:10, Paul Tomblin via Mailman-users wrote: The HyperKitty update_and_clean_index job runs monthly and does the same thing as the hourly update_index job plus removing index entries for messages that have been deleted from HyperKitty. It is very long running
Does that mean that removing the cron job for the monthly task would do the same thing as the patch?
It's purpose is to drop entries for messages that have been removed.
How do messages get removed? Is that an automatic thing or if not, is it something a user can do or only a list owner?
-- Paul Tomblin
On 3/1/26 13:03, Paul Tomblin via Mailman-users wrote:
On Sun, Mar 1, 2026, at 3:38 PM, Mark Sapiro wrote:
On 3/1/26 05:10, Paul Tomblin via Mailman-users wrote: The HyperKitty update_and_clean_index job runs monthly and does the same thing as the hourly update_index job plus removing index entries for messages that have been deleted from HyperKitty. It is very long running
Does that mean that removing the cron job for the monthly task would do the same thing as the patch?
No. Removing the cron job would not run any monthly jobs and there is another one, namely the hyperkitty empty_threads job which removes empty threads.
It's purpose is to drop entries for messages that have been removed.
How do messages get removed? Is that an automatic thing or if not, is it something a user can do or only a list owner?
Only by list owners or site admins via the Delete this message link in
the message view or the Delete this thread link in the thread view.
And I suspect but am not certain that the only thing that results in an empty thread is deleting all the messages in a thread one by one without deleting the thread.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Paul Tomblin writes:
On Sun, Mar 1, 2026, at 5:43 AM, Stephen J. Turnbull wrote:
Paul Tomblin via Mailman-users writes:
Full-text indexing is a pretty slow operation, and it's I/O bound. If you've migrated a substantial archive, that will be a full-archive
Is 431,000+ messages a substantial archive?
I would think that would be measured in terms of hours, not days, unless there are a substantial proportion of messages that are like all of Congress's annual appropriations bill.
Of course in terms of how long it would be pinning I/O the important unit is "terabytes", not "messages".
after the current message gets indexed). But if you didn't manually trigger a full archive index at the initial migration, the first
I did a "update_index_one_list" on each archive after I brought it into hyperkitty. I was impressed with the full text search.
Yes, Xapian is quite good for my purposes. Especially compared to "Whoosh" for performance. I can't say about the other backends supported by Django Haystack, I'd like to try them as well.
-- GNU Mailman consultant (installation, migration, customization) Sirius Open Source https://www.siriusopensource.com/ Software systems consulting in Europe, North America, and Japan
participants (3)
-
Mark Sapiro -
Paul Tomblin -
Stephen J. Turnbull