Hyperkitty fulltext search performances
Hi,
I'm in the process of importing a large number of mbox (~30,000) for a few hundred mailing lists. So far around 300,000 mails (~12GB) were imported from two mailing lists. The hyperkitty_import process took ~8 hours and created a 10GB MySQL database. The update_index_one_list for the two lists took a total of ~72 hours and created a ~2GB worth of index in /var/lib/mailman3/web/fulltext_index, which is consistent with the fact that there are lot of attachments (probably 10GB out of 12GB).
When I search for one word via the full text search web interface, it takes around 30 seconds to complete (even when repeated twice) and I can see the process grow to use up to 6GB of resident memory. It is running on a recent physical machine with decent IO and a Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.
Is this consistent with other people experience?
Cheers
-- Loïc Dachary, Artisan Logiciel Libre
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backends https://django-haystack.readthedocs.io/en/master/backend_support.html
On 04/10/2020 13:01, Loïc Dachary wrote:
Hi,
I'm in the process of importing a large number of mbox (~30,000) for a few hundred mailing lists. So far around 300,000 mails (~12GB) were imported from two mailing lists. The hyperkitty_import process took ~8 hours and created a 10GB MySQL database. The update_index_one_list for the two lists took a total of ~72 hours and created a ~2GB worth of index in /var/lib/mailman3/web/fulltext_index, which is consistent with the fact that there are lot of attachments (probably 10GB out of 12GB).
When I search for one word via the full text search web interface, it takes around 30 seconds to complete (even when repeated twice) and I can see the process grow to use up to 6GB of resident memory. It is running on a recent physical machine with decent IO and a Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.
Is this consistent with other people experience?
Cheers
-- Loïc Dachary, Artisan Logiciel Libre
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html For our servers running Postorius/Hyperkitty, we use Xapian. I believe
On 10/4/20 8:59 AM, Loïc Dachary wrote: the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.
-- Brian Carpenter Harmonylists.com Emwd.com
On 10/4/20 8:07 AM, Brian Carpenter wrote:
On 10/4/20 8:59 AM, Loïc Dachary wrote:
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html
For our servers running Postorius/Hyperkitty, we use Xapian. I believe the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.
Brian is correct about this list. We started with Whoosh, then went to Elasticsearch and ultimately to Xapian. Each step gave better performance and a smaller footprint than the one before.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Oct 4, 2020, at 8:13 AM, Mark Sapiro <mark@msapiro.net> wrote:
On 10/4/20 8:07 AM, Brian Carpenter wrote:
On 10/4/20 8:59 AM, Loïc Dachary wrote:
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html
For our servers running Postorius/Hyperkitty, we use Xapian. I believe the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.
Brian is correct about this list. We started with Whoosh, then went to Elasticsearch and ultimately to Xapian. Each step gave better performance and a smaller footprint than the one before.
+1
I have ~700K posts in my archive and Xapian is FAST.
- Mark
mark@pdc-racing.net | 408-348-2878
participants (4)
-
Brian Carpenter
-
Loïc Dachary
-
Mark Dadgar
-
Mark Sapiro