Hello,
This morning our Mailman 3/Hyperkitty/Postorius installation has become unresponsive. When looking at the machine, postgres is pegging all 4 CPUs on the machine.
We did some digging in postgres, and here is the query that is causing problems:
"SELECT hyperkitty_email.date FROM hyperkitty_email WHERE hyperkitty_email.mailinglist_id = 'fooey@rootsweb.com' ORDER BY hyperkitty_email.date ASC LIMIT 1;"
We have found that it doesn't matter what mailinglist_id is being used.
We have also found that this might be an issue with the LIMIT 1 - our investigation is showing that using LIMIT 1 in postgres makes things execute very differently than if you use a limit 2 or more, and that this is probably what is going on.
Basically, every time this gets called, it pegs an entire CPU for 20-30 seconds. It appears to be called a LOT.
So our users are getting 504 gateway timeouts right now.
What has changed:
We have been in the process of importing our 40 million archived emails from our old mailman 2 archives.
Questions:
We have tried doing a grep in the codebase - mailman3, hyperkitty, postorius, django - basically anything that might be calling this code, but we can't find it. Does anyone know what generates this query?
We are positing that it might be our archive imports that are causing this problem. Does anyone know of any sort of limit/threshold that might cause this to go haywire?
Any help you can give, any direction you can point us would be useful. Right now our lists are just down for our users.
Thank you,
Darren
On 05/24/2018 06:21 PM, Darren Smith wrote:
Hello,
This morning our Mailman 3/Hyperkitty/Postorius installation has become unresponsive. When looking at the machine, postgres is pegging all 4 CPUs on the machine.
We did some digging in postgres, and here is the query that is causing problems:
"SELECT hyperkitty_email.date FROM hyperkitty_email WHERE hyperkitty_email.mailinglist_id = 'fooey@rootsweb.com' ORDER BY hyperkitty_email.date ASC LIMIT 1;"
We have found that it doesn't matter what mailinglist_id is being used.
We have also found that this might be an issue with the LIMIT 1 - our investigation is showing that using LIMIT 1 in postgres makes things execute very differently than if you use a limit 2 or more, and that this is probably what is going on.
Basically, every time this gets called, it pegs an entire CPU for 20-30 seconds. It appears to be called a LOT.
So our users are getting 504 gateway timeouts right now.
What has changed:
We have been in the process of importing our 40 million archived emails from our old mailman 2 archives.
Questions:
- We have tried doing a grep in the codebase - mailman3, hyperkitty, postorius, django - basically anything that might be calling this code, but we can't find it. Does anyone know what generates this query? It looks like the query is to get the date from the last email for a mailinglist. AFAIK Hyperkitty checks this when new emails get it (maybe even during import)
It could also relate to the "active" state of mailinglists...
So we have just found the culprit, or at least ONE of the culprits. The calls that are being made to populate the date browse in the left hand side of hyperkitty for a given mailing list was causing this query to be run.
I still don't know what changed from yesterday to today that is causing postgres to freak out over this call - it used to work just fine.
On a different note, there appear to be a lot of issues in the system that don't show up until the system is used at scale. As a question, is there any sort of system of load testing that mailman/hyperkitty/postorious/etc. can be put through so that we can find these issues before they bring down our system in production?
And on a related note, we actually are in the process of setting up a testing server that we are going to populate with our live data. We should be able to run certain types of load testing against it. We won't really be able to test the flow of the 150,000 emails per day that go through our system, but testing the UI and what happens to it when it is under load, that we can test.
-Darren
On Thu, May 24, 2018 at 12:57 PM, Simon Hanna <simon@hannaweb.eu> wrote:
On 05/24/2018 06:21 PM, Darren Smith wrote:
Hello,
This morning our Mailman 3/Hyperkitty/Postorius installation has become unresponsive. When looking at the machine, postgres is pegging all 4 CPUs on the machine.
We did some digging in postgres, and here is the query that is causing problems:
"SELECT hyperkitty_email.date FROM hyperkitty_email WHERE hyperkitty_email.mailinglist_id = 'fooey@rootsweb.com' ORDER BY hyperkitty_email.date ASC LIMIT 1;"
We have found that it doesn't matter what mailinglist_id is being used.
We have also found that this might be an issue with the LIMIT 1 - our investigation is showing that using LIMIT 1 in postgres makes things execute very differently than if you use a limit 2 or more, and that this is probably what is going on.
Basically, every time this gets called, it pegs an entire CPU for 20-30 seconds. It appears to be called a LOT.
So our users are getting 504 gateway timeouts right now.
What has changed:
We have been in the process of importing our 40 million archived emails from our old mailman 2 archives.
Questions:
- We have tried doing a grep in the codebase - mailman3, hyperkitty, postorius, django - basically anything that might be calling this code, but we can't find it. Does anyone know what generates this query? It looks like the query is to get the date from the last email for a mailinglist. AFAIK Hyperkitty checks this when new emails get it (maybe even during import)
It could also relate to the "active" state of mailinglists...
Mailman-users mailing list mailman-users@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/
participants (2)
-
Darren Smith
-
Simon Hanna