Hyperkitty fulltext search performances

Loïc Dachary

Oct. 4, 2020

11:01 a.m.

Hi,

I'm in the process of importing a large number of mbox (~30,000) for a few hundred mailing lists. So far around 300,000 mails (~12GB) were imported from two mailing lists. The hyperkitty_import process took ~8 hours and created a 10GB MySQL database. The update_index_one_list for the two lists took a total of ~72 hours and created a ~2GB worth of index in /var/lib/mailman3/web/fulltext_index, which is consistent with the fact that there are lot of attachments (probably 10GB out of 12GB).

When I search for one word via the full text search web interface, it takes around 30 seconds to complete (even when repeated twice) and I can see the process grow to use up to 6GB of resident memory. It is running on a recent physical machine with decent IO and a Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.

Is this consistent with other people experience?

Cheers

-- Loïc Dachary, Artisan Logiciel Libre

Attachments:

OpenPGP_signature.sig (application/pgp-signature — 840 bytes)

Show replies by date

Loïc Dachary

October 2020

12:59 p.m.

After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backends https://django-haystack.readthedocs.io/en/master/backend_support.html

On 04/10/2020 13:01, Loïc Dachary wrote:

...

Hi,

I'm in the process of importing a large number of mbox (~30,000) for a few hundred mailing lists. So far around 300,000 mails (~12GB) were imported from two mailing lists. The hyperkitty_import process took ~8 hours and created a 10GB MySQL database. The update_index_one_list for the two lists took a total of ~72 hours and created a ~2GB worth of index in /var/lib/mailman3/web/fulltext_index, which is consistent with the fact that there are lot of attachments (probably 10GB out of 12GB).

When I search for one word via the full text search web interface, it takes around 30 seconds to complete (even when repeated twice) and I can see the process grow to use up to 6GB of resident memory. It is running on a recent physical machine with decent IO and a Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.

Is this consistent with other people experience?

Cheers

-- Loïc Dachary, Artisan Logiciel Libre

Brian Carpenter

3:07 p.m.

...

After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html For our servers running Postorius/Hyperkitty, we use Xapian. I believe

On 10/4/20 8:59 AM, Loïc Dachary wrote: the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.

-- Brian Carpenter Harmonylists.com Emwd.com

Mark Sapiro

3:13 p.m.

On 10/4/20 8:07 AM, Brian Carpenter wrote:

...

On 10/4/20 8:59 AM, Loïc Dachary wrote:

...
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html

For our servers running Postorius/Hyperkitty, we use Xapian. I believe the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.

Brian is correct about this list. We started with Whoosh, then went to Elasticsearch and ultimately to Xapian. Each step gave better performance and a smaller footprint than the one before.

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Dadgar

4:59 p.m.

On Oct 4, 2020, at 8:13 AM, Mark Sapiro <mark@msapiro.net> wrote:

...

On 10/4/20 8:07 AM, Brian Carpenter wrote:

...
On 10/4/20 8:59 AM, Loïc Dachary wrote:

...
After a little digging it turns out the mailman3-web Debian GNU/Linux package I'm using is configured to use Woosh as a haystack backend. And it is not fit for volumes greater than a few hundred mega bytes. I should switch to something else from the list of supported backendshttps://django-haystack.readthedocs.io/en/master/backend_support.html

For our servers running Postorius/Hyperkitty, we use Xapian. I believe the mailman-users (MM3) list also uses Xapian. It is a lot quicker. For our Affinity/Empathy servers, we use Elasticsearch which we are very happy with. Elasticsearch does have a higher memory requirement I believe than Xapian.

Brian is correct about this list. We started with Whoosh, then went to Elasticsearch and ultimately to Xapian. Each step gave better performance and a smaller footprint than the one before.

I have ~700K posts in my archive and Xapian is FAST.

Mark

mark@pdc-racing.net | 408-348-2878

1925

Age (days ago)

1925

Last active (days ago)

List overview

Download

4 comments

4 participants

participants (4)

Brian Carpenter
Loïc Dachary
Mark Dadgar
Mark Sapiro