update_index_one_list in parallel?

newer
How to separate bans for sending...

Szymon Sokół

May 28, 2019

10:44 a.m.

Hello, I am new to this list. Last week I upgraded my old Mailman 2.1 site (running since 2011) to Mailman 3.2.1; wasn't quite straightforward, and part of the blame goes to people who made the choice of using sqlite as default database backend for mailman3-full Debian package, but finally I have it running with PostgreSQL, which I should've been using from the beginning.

Now, however, I have noticed that indexing my archives in hyperkitty (using "manage.py update_index_one_list" for each list in sequence) takes quite a lot of time while using only one CPU (out of 4). Hence my question - is it safe to run several such processes in parallel, and if so, will it have any benefits? I have no idea how locking on the database is done in this code, maybe I'll end up running the jobs in sequence anyway due to exclusive locks… Any informed answers will be appreciated.

Show replies by date

Abhilash Raj

May 2019

2:58 p.m.

On Tue, May 28, 2019, at 5:56 AM, Szymon Sokół wrote:

...

Hello, I am new to this list. Last week I upgraded my old Mailman 2.1 site (running since 2011) to Mailman 3.2.1; wasn't quite straightforward, and part of the blame goes to people who made the choice of using sqlite as default database backend for mailman3-full Debian package, but finally I have it running with PostgreSQL, which I should've been using from the beginning.

The Debian package uses sqlite3 to ease the initial installation process, the README.debian for the package does mention that Sqlite3 is *not* recommended.1

Please feel free to open an issue with Debian if you think that is something which needs to be more visible to users.

...

Now, however, I have noticed that indexing my archives in hyperkitty (using "manage.py update_index_one_list" for each list in sequence) takes quite a lot of time while using only one CPU (out of 4). Hence my question - is it safe to run several such processes in parallel, and if so, will it have any benefits?

We haven't tried, but, my assumption is that the default search backend is based on Python and will be single process by default. One could think about splitting the emails to index in 4 parts and then running the indexing job, but that would be a lot more work.

Instead, I suggest you try some other search backends that we support.2 Again, the default one works for small sites and scales okay when you have a few emails, but it takes a lot of time indexing when you have more emails.

We recently tested Elasticsearch on the server hosting this very list and the differences were *massive*.

This last part of the info does need to go somewhere in the docs, but, if you need help setting up one of the other search backends, let us know here.

...

I have no idea how locking on the database is done in this code, maybe I'll end up running the jobs in sequence anyway due to exclusive locks… Any informed answers will be appreciated.

Mailman-users mailing list -- mailman-users@mailman3.org To unsubscribe send an email to mailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/

-- thanks, Abhilash Raj (maxking)

Mark Sapiro

2:15 a.m.

On 5/28/19 3:44 AM, Szymon Sokół wrote:

...

Now, however, I have noticed that indexing my archives in hyperkitty (using "manage.py update_index_one_list" for each list in sequence) takes quite a lot of time while using only one CPU (out of 4). Hence my question - is it safe to run several such processes in parallel, and if so, will it have any benefits? I have no idea how locking on the database is done in this code, maybe I'll end up running the jobs in sequence anyway due to exclusive locks… Any informed answers will be appreciated.

As Abhilash says, we really haven't tried. You could try running more than one "manage.py update_index_one_list" for different lists in parallel and see how it goes.

I suspect it depends on the Haystack backend. Whoosh tends to be used because it's pure Python and installable with pip, but again as Abhilash says, we have recently converted to elasticsearch. There are some issues with that both due to the fact that elasticsearch runs as a separate service on the host, and we are using elasticsearch 5, the Haystack support for which is still developmental (and the current elasticsearch version is 7), but for example, indexing 180,000+ messages on mail.python.org takes 15 minutes elapsed time (about 14 minutes CPU) with elasticsearch which is an order of magnitude faster than the multiple hours it took with Whoosh.

On the other hand, this only affects search in HyperKitty, and you didn't have archive search in Mailman 2.1, so if it takes an extra day to get it working in HyperKitty, it's probably not a big deal.

You could just migrate all your lists and when done, run

manage.py rebuild_index &

and just let it run in the background until it's done.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

2350

Age (days ago)

2351

Last active (days ago)

List overview

Download

2 comments

3 participants

participants (3)

Abhilash Raj
Mark Sapiro
Szymon Sokół