On 5/28/19 3:44 AM, Szymon Sokół wrote:
Now, however, I have noticed that indexing my archives in hyperkitty (using "manage.py update_index_one_list" for each list in sequence) takes quite a lot of time while using only one CPU (out of 4). Hence my question - is it safe to run several such processes in parallel, and if so, will it have any benefits? I have no idea how locking on the database is done in this code, maybe I'll end up running the jobs in sequence anyway due to exclusive locks… Any informed answers will be appreciated.
As Abhilash says, we really haven't tried. You could try running more than one "manage.py update_index_one_list" for different lists in parallel and see how it goes.
I suspect it depends on the Haystack backend. Whoosh tends to be used because it's pure Python and installable with pip, but again as Abhilash says, we have recently converted to elasticsearch. There are some issues with that both due to the fact that elasticsearch runs as a separate service on the host, and we are using elasticsearch 5, the Haystack support for which is still developmental (and the current elasticsearch version is 7), but for example, indexing 180,000+ messages on mail.python.org takes 15 minutes elapsed time (about 14 minutes CPU) with elasticsearch which is an order of magnitude faster than the multiple hours it took with Whoosh.
On the other hand, this only affects search in HyperKitty, and you didn't have archive search in Mailman 2.1, so if it takes an extra day to get it working in HyperKitty, it's probably not a big deal.
You could just migrate all your lists and when done, run
manage.py rebuild_index &
and just let it run in the background until it's done.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan