Speed/progress of update_index_one_list?
Yesterday, I ran "mailman-web update_index_one_list" against a mailing list. The command output "Indexing 328077 emails" and that is the last I've heard from it.
The process is still running but, looking in ~mailman/web/logs, I can't see anything happening related to that list.
Is there any way to determine how far this process has got, or what it is actually doing? (Without stopping it)
For future indexing, what makes the most difference to the speed? Does it benefit from having multiple cores if I resize the AWS instance to a larger processor? What is the bottleneck for this process?
Thank you.
Regards
Philip
On Jan 13, 2022, at 11:46 PM, Philip Colmer <philip.colmer@linaro.org> wrote:
Yesterday, I ran "mailman-web update_index_one_list" against a mailing list. The command output "Indexing 328077 emails" and that is the last I've heard from it.
The process is still running but, looking in ~mailman/web/logs, I can't see anything happening related to that list.
Is there any way to determine how far this process has got, or what it is actually doing? (Without stopping it)
For future indexing, what makes the most difference to the speed? Does it benefit from having multiple cores if I resize the AWS instance to a larger processor? What is the bottleneck for this process?
Which indexer are you using? Whoosh is interminably slow. Think “days to generate a list index” slow.
I recommend Xapien. It is ridiculously fast.
- Mark
mark@pdc-racing.net | 408-348-2878
On Fri, 14 Jan 2022 at 07:50, Mark Dadgar <mark@pdc-racing.net> wrote:
On Jan 13, 2022, at 11:46 PM, Philip Colmer <philip.colmer@linaro.org> wrote:
Yesterday, I ran "mailman-web update_index_one_list" against a mailing list. The command output "Indexing 328077 emails" and that is the last I've heard from it.
The process is still running but, looking in ~mailman/web/logs, I can't see anything happening related to that list.
Is there any way to determine how far this process has got, or what it is actually doing? (Without stopping it)
For future indexing, what makes the most difference to the speed? Does it benefit from having multiple cores if I resize the AWS instance to a larger processor? What is the bottleneck for this process?
Which indexer are you using? Whoosh is interminably slow. Think “days to generate a list index” slow.
Yeah, we're using Whoosh. Good to see it lives up to its name :)
I recommend Xapien. It is ridiculously fast.
Thank you. I had been put off using Xapian because the documentation, from a MM3-perspective, seemed sparse and confusing, but I'll stick with it to try and get it working if Whoosh is the root problem here.
Philip
I have now switched to Xapian but, in a way, my original questions still stand. Is there a way of monitoring the progress of "update_index_one_list"? What can I do to the specification of the server to make that process go (much) faster?
By the way, just in case anyone else is looking to use Xapian, installing and setting up Xapian for HyperKitty can be boiled down to:
- Download the git repo at https://github.com/notanumber/xapian-haystack
- Run the "install_xapian.sh" script (supplying the version number of Xapian)
- pip install xapian-haystack
- Edit the MM3 settings.py file to switch to using Xapian
Regards
Philip
On Fri, 14 Jan 2022 at 07:57, Philip Colmer <philip.colmer@linaro.org> wrote:
On Fri, 14 Jan 2022 at 07:50, Mark Dadgar <mark@pdc-racing.net> wrote:
On Jan 13, 2022, at 11:46 PM, Philip Colmer <philip.colmer@linaro.org> wrote:
Yesterday, I ran "mailman-web update_index_one_list" against a mailing list. The command output "Indexing 328077 emails" and that is the last I've heard from it.
The process is still running but, looking in ~mailman/web/logs, I can't see anything happening related to that list.
Is there any way to determine how far this process has got, or what it is actually doing? (Without stopping it)
For future indexing, what makes the most difference to the speed? Does it benefit from having multiple cores if I resize the AWS instance to a larger processor? What is the bottleneck for this process?
Which indexer are you using? Whoosh is interminably slow. Think “days to generate a list index” slow.
Yeah, we're using Whoosh. Good to see it lives up to its name :)
I recommend Xapien. It is ridiculously fast.
Thank you. I had been put off using Xapian because the documentation, from a MM3-perspective, seemed sparse and confusing, but I'll stick with it to try and get it working if Whoosh is the root problem here.
Philip
On Fri, 14 Jan 2022 at 08:38, Philip Colmer <philip.colmer@linaro.org> wrote:
I have now switched to Xapian but, in a way, my original questions still stand. Is there a way of monitoring the progress of "update_index_one_list"? What can I do to the specification of the server to make that process go (much) faster?
I have now found the "-v" option to increase the verbosity of the command but, beyond giving me a worker PID for the index process, I don't seem to be getting anything useful. I've also emailed the Xapian mailing list to see if they have any suggestions regarding server specification.
Philip Colmer writes:
I have now switched to Xapian but, in a way, my original questions still stand. Is there a way of monitoring the progress of "update_index_one_list"? What can I do to the specification of the server to make that process go (much) faster?
I see you've already gone to the Xapian lists, which I think is the right place to ask. We appreciate that! But I hope everyone will feel free to ask such questions here (just don't get too upset if we suggest going to the upstream channels in lieu of providing answers ourselves :-).
If you do get an answer, feel free to bring it back and ask for Mailman automation for it (as a tracker issue, please, so it doesn't get lost). As I've mentioned before, it's That Time of Year when we start to prepare for GSoC. No guarantees and eventually the students choose their own projects, but suggestions are welcome.
By the way, just in case anyone else is looking to use Xapian, installing and setting up Xapian for HyperKitty can be boiled down to:
Thanks for the hints!
Steve
On 1/14/22 12:38 AM, Philip Colmer wrote:
I have now switched to Xapian but, in a way, my original questions still stand. Is there a way of monitoring the progress of "update_index_one_list"? What can I do to the specification of the server to make that process go (much) faster?
HyperKitty does the indexing by calling the Haystack update_index command. See https://django-haystack.readthedocs.io/en/master/management_commands.html#up...
This command has options such as --verbosity and --workers. I think verbosity is set from the option provided to update_index_one_list and if set to 2 will give some progress info.
workers could increase parallelism but is unconditionally passed by hyperkitty as 0
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
We've now completed the indexing of our lists, some of which are quite large (100K+ messages per list). I wanted to share our findings of using Xapian to do this.
Firstly, using "-v 2", e.g. "mailman-web update_index_one_list -v 2 <email address>", causes the command to print out the progress being made in batches of 1,000 messages.
Secondly, Xapian is single-threaded. It only allows one writer at a time. Therefore, we focussed on speed of the hard drives and, on the advice received from the Xapian list, memory. Since we were using AWS EC2 with EBS storage, I elected to move the EBS storage from gp2 to gp3 - which has a higher base I/O figure - and I selected a r6i.8xlarge as this guaranteed 10Gb throughput to the EBS storage. It also delivered 32 vCPU and 256GB RAM. It was possibly overkill but it did the job (see below) and we've now switched back to a t3a.xlarge.
Speed-wise, with the above configuration, the system indexed a list of nearly 85K messages in 46 minutes.
I hope that is helpful information to anyone else who finds themselves migrating large Mailman 2 archives.
Regards
Philip
On Fri, 14 Jan 2022 at 17:33, Mark Sapiro <mark@msapiro.net> wrote:
On 1/14/22 12:38 AM, Philip Colmer wrote:
I have now switched to Xapian but, in a way, my original questions still stand. Is there a way of monitoring the progress of "update_index_one_list"? What can I do to the specification of the server to make that process go (much) faster?
HyperKitty does the indexing by calling the Haystack update_index command. See
https://django-haystack.readthedocs.io/en/master/management_commands.html#up...
This command has options such as --verbosity and --workers. I think verbosity is set from the option provided to update_index_one_list and if set to 2 will give some progress info.
workers could increase parallelism but is unconditionally passed by hyperkitty as 0
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-users mailing list -- mailman-users@mailman3.org To unsubscribe send an email to mailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
participants (4)
-
Mark Dadgar
-
Mark Sapiro
-
Philip Colmer
-
Stephen J. Turnbull