Hi,
Hyperkitty 1.3.4.
I am trying to download a complete list mbox by going to all threads view and using the download option. I have tried a couple of tools (Gunzip and Winrar) and both are giving me an unexpected end of file when trying to decompress the gz file.
Here is the list URL I am using: https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily...
Any suggestions? Thanks. Andrew.
On 4/25/21 2:20 PM, Andrew Hodgson wrote:
Hi,
Hyperkitty 1.3.4.
I am trying to download a complete list mbox by going to all threads view and using the download option. I have tried a couple of tools (Gunzip and Winrar) and both are giving me an unexpected end of file when trying to decompress the gz file.
Here is the list URL I am using: https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily...
Depending in the web server configuration, timeouts can occur when downloading large archive mboxes. instead of downloading the entire mbox with <https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily.org/export/plextalk@lists.hodgsonfamily.org-2021-04.mbox.gz?start=2008-02-19&end=2021-04-25>, do it in pieces by adjusting start and end. e.g.
Although, I don't think timing out is the issue, and I'm not sure what is, but I think it has something to do with messages in the archive. If I try to get the 3 pieces as above, the first piece with start=2008-02-19&end=2012-12-31 works but the others don't and even the smaller
fails the same way.
If I try to get 2013 month by month, all work except December which gives me an internal server error. What's the traceback from that error.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 4/25/21 2:20 PM, Andrew Hodgson wrote:
Hi,
Hyperkitty 1.3.4.
I am trying to download a complete list mbox by going to all threads view and using the download option. I have tried a couple of tools (Gunzip and Winrar) and both are giving me an unexpected end of file when trying to decompress the gz file.
Here is the list URL I am using: https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily... Depending in the web server configuration, timeouts can occur when downloading large archive mboxes. instead of downloading the entire mbox with <https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily.org/export/plextalk@lists.hodgsonfamily.org-2021-04.mbox.gz?start=2008-02-19&end=2021-04-25>, do it in pieces by adjusting start and end. e.g.
Although, I don't think timing out is the issue, and I'm not sure what is, but I think it has something to do with messages in the archive. If I try to get the 3 pieces as above, the first piece with start=2008-02-19&end=2012-12-31 works but the others don't and even
On 25-Apr-21 18:08, Mark Sapiro wrote: the
smaller
fails the same way.
If I try to get 2013 month by month, all work except December which gives me an internal server error. What's the traceback from that error.
The described timeouts are something that hyperkitty ought to be able to avoid. For apache, the timeout is idle time between blocks of output. Hyperkitty can avoid this by generating the archive in segments (based on size, or elapsed time), flushing its output buffer, generating a multi-file archive, and/or using Transfer-Encoding: chunked (chunked doesn't work for http/2). It ought to be able to break the work into blocks of "n" messages & do something to generate output. Besides avoiding timeouts, working in segments allows the GUI to display meaningful progress (e.g. if you're loading with XMLHttpRequest, "onprogress") It really oughtn't be up to the user to break up the request.
Until then: the apache directive is "TimeOut" (or "ProxyTimeout"), with a default value of 60 (seconds). It's a server config/virtual host parameter, so if you're running in an environment where you only have .htaccess, you need admin help or you're out of luck.
Other webservers (especially those with accelerators) may have more granular timeouts.
On 4/25/21 4:37 PM, tlhackque via Mailman-users wrote:
The described timeouts are something that hyperkitty ought to be able to avoid. For apache, the timeout is idle time between blocks of output. Hyperkitty can avoid this by generating the archive in segments (based on size, or elapsed time), flushing its output buffer, generating a multi-file archive, and/or using Transfer-Encoding: chunked (chunked doesn't work for http/2). It ought to be able to break the work into blocks of "n" messages & do something to generate output. Besides avoiding timeouts, working in segments allows the GUI to display meaningful progress (e.g. if you're loading with XMLHttpRequest, "onprogress") It really oughtn't be up to the user to break up the request.
It is not the web server that times out. I'm not sure about uwsgi because I don't use it, but the timeouts I see are on servers that use gunicorn as the WSGI interface to Django and the timeout is in a gunicorn worker. This is controlled by the timeout setting in the gunicorn config. <https://docs.gunicorn.org/en/stable/settings.html#timeout>
Note that even 300 seconds is not enough to download the entire <https://mail.python.org/archives/list/python-dev@python.org/> archive.
It may be possible to get HyperKitty to chunk the output to avoid this, but it doesn't currently do that. Care to submit an MR?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 25-Apr-21 20:34, Mark Sapiro wrote:
On 4/25/21 4:37 PM, tlhackque via Mailman-users wrote:
The described timeouts are something that hyperkitty ought to be able to avoid. For apache, the timeout is idle time between blocks of output. Hyperkitty can avoid this by generating the archive in segments (based on size, or elapsed time), flushing its output buffer, generating a multi-file archive, and/or using Transfer-Encoding: chunked (chunked doesn't work for http/2). It ought to be able to break the work into blocks of "n" messages & do something to generate output. Besides avoiding timeouts, working in segments allows the GUI to display meaningful progress (e.g. if you're loading with XMLHttpRequest, "onprogress") It really oughtn't be up to the user to break up the request. It is not the web server that times out. I'm not sure about uwsgi because I don't use it, but the timeouts I see are on servers that use gunicorn as the WSGI interface to Django and the timeout is in a gunicorn worker. This is controlled by the timeout setting in the gunicorn config. <https://docs.gunicorn.org/en/stable/settings.html#timeout>
Note that even 300 seconds is not enough to download the entire <https://mail.python.org/archives/list/python-dev@python.org/> archive.
It may be possible to get HyperKitty to chunk the output to avoid this, but it doesn't currently do that. Care to submit an MR?
I'm afraid (u)WSGI, Django, and gunicorn are not technologies that I work with.
It sounds as if hyperkitty is compiling the entire archive before sending the first byte.
The gunicorn doc that you pointed to says
Workers silent for more than this many seconds are killed and restarted. Setting it to 0 has the effect of infinite timeouts by disabling timeouts for all workers entirely.
"Silent" sounds like the standard webserver "you have to push some bits, or we assume you're stuck".
My understanding is that gunicorn is a Python persistence server that is run behind a webserver proxy. So the (proxy) webserver (apache, nginx, ...) timeouts also apply and would need to be increased.
Might be interesting to try 0 (gunicorn) / 1200 (webserver) with your python-dev archive, time it and see how much (encoded) data is transferred... (I would expect most mailing list archives to compress nicely, though those with binary attachments wont.)
But fiddling with timeouts is treating the symptom, not the root cause. The right solution is to stream, segment (or chunk) the output, because in the general case, no timeout is long enough. It'll always be possible to find an archive that's just one byte (or second) longer than any chosen timeout. (See the halting problem.) You want the timeout to catch a lack of progress, not total time that's a function of transaction size. (Webservers may also have limits on transaction size
- e.g. mod_security, but they're only useful when the upper bound on a response is knowable.) Thus, the timeout(s) should be roughly independent of the archive size; on the order of time-to-first-byte (which ordinarily is longer than time between segments/chunks).
Also note that streaming requires fewer server resources than compiling a complete archive before sending, since you don't need to create the entire archive in memory (or in a tempfile). You only need enough memory to efficiently buffer the file I/O and to contain the compression tables/output buffer. Except for trivial cases, this will be independent of the archive size. The only downside is that if the comm link is slow, you may hold a reader lock on the source data for longer than necessary with the current scheme.
Seems as though this deserves an issue. I guess I could open one - but you have the experience/test cases.
On Apr 25, 2021, at 7:06 PM, tlhackque via Mailman-users <mailman-users@mailman3.org> wrote:
On 25-Apr-21 20:34, Mark Sapiro wrote:
On 4/25/21 4:37 PM, tlhackque via Mailman-users wrote:
The described timeouts are something that hyperkitty ought to be able to avoid. For apache, the timeout is idle time between blocks of output. Hyperkitty can avoid this by generating the archive in segments (based on size, or elapsed time), flushing its output buffer, generating a multi-file archive, and/or using Transfer-Encoding: chunked (chunked doesn't work for http/2). It ought to be able to break the work into blocks of "n" messages & do something to generate output. Besides avoiding timeouts, working in segments allows the GUI to display meaningful progress (e.g. if you're loading with XMLHttpRequest, "onprogress") It really oughtn't be up to the user to break up the request. It is not the web server that times out. I'm not sure about uwsgi because I don't use it, but the timeouts I see are on servers that use gunicorn as the WSGI interface to Django and the timeout is in a gunicorn worker. This is controlled by the timeout setting in the gunicorn config. <https://docs.gunicorn.org/en/stable/settings.html#timeout>
Note that even 300 seconds is not enough to download the entire <https://mail.python.org/archives/list/python-dev@python.org/> archive.
It may be possible to get HyperKitty to chunk the output to avoid this, but it doesn't currently do that. Care to submit an MR?
I'm afraid (u)WSGI, Django, and gunicorn are not technologies that I work with.
It sounds as if hyperkitty is compiling the entire archive before sending the first byte.
The gunicorn doc that you pointed to says
Workers silent for more than this many seconds are killed and restarted. Setting it to 0 has the effect of infinite timeouts by disabling timeouts for all workers entirely.
"Silent" sounds like the standard webserver "you have to push some bits, or we assume you're stuck".
My understanding is that gunicorn is a Python persistence server that is run behind a webserver proxy. So the (proxy) webserver (apache, nginx, ...) timeouts also apply and would need to be increased.
Might be interesting to try 0 (gunicorn) / 1200 (webserver) with your python-dev archive, time it and see how much (encoded) data is transferred... (I would expect most mailing list archives to compress nicely, though those with binary attachments wont.)
For uwsgi, I think the parameter is called harakiri
1 (I don’t know why such a name though).
if request takes longer than specified harakiri time (in seconds), the request will be dropped and the corresponding worker recycled.
This should be set to a long enough value that allows downloading the archive.
If you are using http socket, then you want http-timeout.
Also, to set the timeout in webserver (nginx)
location / { uwsgi_read_timeout 120s; uwsgi_send_timeout 120s; uwsgi_pass 0.0.0.0:8000; include uwsgi_params; }
Or some other value that you want.
But fiddling with timeouts is treating the symptom, not the root cause. The right solution is to stream, segment (or chunk) the output, because in the general case, no timeout is long enough. It'll always be possible to find an archive that's just one byte (or second) longer than any chosen timeout. (See the halting problem.) You want the timeout to catch a lack of progress, not total time that's a function of transaction size. (Webservers may also have limits on transaction size
- e.g. mod_security, but they're only useful when the upper bound on a response is knowable.) Thus, the timeout(s) should be roughly independent of the archive size; on the order of time-to-first-byte (which ordinarily is longer than time between segments/chunks).
Also note that streaming requires fewer server resources than compiling a complete archive before sending, since you don't need to create the entire archive in memory (or in a tempfile). You only need enough memory to efficiently buffer the file I/O and to contain the compression tables/output buffer. Except for trivial cases, this will be independent of the archive size. The only downside is that if the comm link is slow, you may hold a reader lock on the source data for longer than necessary with the current scheme.
Seems as though this deserves an issue. I guess I could open one - but you have the experience/test cases.
Hyperkitty doesn’t actually create an archive in memory or in a temp file. It uses streaming response with on the fly compression to read from database and relay to the client for download.
https://gitlab.com/mailman/hyperkitty/-/blob/master/hyperkitty/views/mlist.p...
The problem could be that uwsgi seems to kill an ongoing downloading process, not an idle one for some reason. And, it seems that it is a known and intentional behavior. I don’t see a good way to disable it completely, but perhaps it can be set to a long enough value that it never essentially kills a running worker which is moving bits.
-- thanks, Abhilash Raj (maxking)
On 26 Apr 2021, at 04:04, Abhilash Raj <maxking@asynchronous.in> wrote:
For uwsgi, I think the parameter is called
harakiri
1 (I don’t know why such a name though).if request takes longer than specified harakiri time (in seconds), the request will be dropped and the corresponding worker recycled.
Maybe it's because as a matter of honour it has to kill itself (or be killed) when it falls short of expectations. There's an interesting article here: https://en.wikipedia.org/wiki/Seppuku
Best wishes
Jonathan
Mark Sapiro wrote:
On 4/25/21 2:20 PM, Andrew Hodgson wrote:
Hi,
Hyperkitty 1.3.4.
I am trying to download a complete list mbox by going to all threads view and using the download option. I have tried a couple of tools (Gunzip and Winrar) and both are giving me an unexpected end of file when trying to decompress the gz file.
[...]
Depending in the web server configuration, timeouts can occur when downloading large archive mboxes. instead of downloading the entire mbox with <https://lists.hodgsonfamily.org/hyperkitty/list/plextalk@lists.hodgsonfamily.org/export/plextalk@lists.hodgsonfamily.org-2021-04.mbox.gz?start=2008-02-19&end=2021-04-25>, do it in pieces by adjusting start and end. e.g.
[...]
Although, I don't think timing out is the issue, and I'm not sure what is, but I think it has something to do with messages in the archive. If I try to get the 3 pieces as above, the first piece with start=2008-02-19&end=2012-12-31 works but the others don't and even the smaller
Yep there is a problem with an imported mbox which seems to be present somewhere in the archive on all the lists which I imported using that method. Here is a recent trace from the download option:
[2021-04-26 08:53:12 +0000] [28181] [ERROR] Error handling request Traceback (most recent call last): File "/opt/mailman/venv/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 180, in handle_request for item in respiter: File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/views/mlist.py", line 335, in stream_mbox msg = email.as_message() File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py", line 178, in as_message msg["Message-ID"] = "<%s>" % self.message_id File "/usr/lib/python3.7/email/message.py", line 409, in __setitem__ self._headers.append(self.policy.header_store_parse(name, val)) File "/usr/lib/python3.7/email/policy.py", line 145, in header_store_parse raise ValueError("Header values may not contain linefeed " ValueError: Header values may not contain linefeed or carriage return characters
I do still have the original mbox files for the imported lists which were on Mailman 2.1. The other thing is when downloading these mbox files I noticed they aren't the raw messages as we had in Mailman 2.1 but modified with headers stripped and email addresses modified. Is there any way to get the raw mbox back as in the option we had in Mailman 2.1?
Thanks. Andrew.
On 4/26/21 2:00 AM, Andrew Hodgson wrote:
Yep there is a problem with an imported mbox which seems to be present somewhere in the archive on all the lists which I imported using that method. Here is a recent trace from the download option:
[2021-04-26 08:53:12 +0000] [28181] [ERROR] Error handling request Traceback (most recent call last): File "/opt/mailman/venv/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 180, in handle_request for item in respiter: File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/views/mlist.py", line 335, in stream_mbox msg = email.as_message() File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py", line 178, in as_message msg["Message-ID"] = "<%s>" % self.message_id File "/usr/lib/python3.7/email/message.py", line 409, in __setitem__ self._headers.append(self.policy.header_store_parse(name, val)) File "/usr/lib/python3.7/email/policy.py", line 145, in header_store_parse raise ValueError("Header values may not contain linefeed " ValueError: Header values may not contain linefeed or carriage return characters
Thanks for the traceback. The issue appears to be with folded headers that should either be unfolded before storing in HyperKitty's database or upon retrieval for this process. This is probably because of outlook.com folding Message-ID: headers. See https://gitlab.com/mailman/mailman/-/issues/844 for one manifestation of this.
I think if you replace the
msg["Message-ID"] = "<%s>" % self.message_id
line at 178 in /opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py with
msg["Message-ID"] = "<%s>" % self.message_id.strip(' <>\r\n')
it will avoid this issue.
Please try it and report.
I do still have the original mbox files for the imported lists which were on Mailman 2.1. The other thing is when downloading these mbox files I noticed they aren't the raw messages as we had in Mailman 2.1 but modified with headers stripped and email addresses modified. Is there any way to get the raw mbox back as in the option we had in Mailman 2.1?
Not from HyperKitty. It doesn't store the raw messages anywhere, but if you enable the prototype archiver for a list, the raw messages will be stored in maildir format in var/archives/prototype/list@example.com/ for each enabled list. Messages imported with hyperkitty_import won't be there, but all posts to the list, at least since enabling the prototype archiver, will be.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
On 4/26/21 2:00 AM, Andrew Hodgson wrote:
Yep there is a problem with an imported mbox which seems to be present somewhere in the archive on all the lists which I imported using that method. Here is a recent trace from the download option:
[...]
Thanks for the traceback. The issue appears to be with folded headers that should either be unfolded before storing in HyperKitty's database or upon retrieval for this process. This is probably because of outlook.com folding Message-ID: headers. See https://gitlab.com/mailman/mailman/-/issues/844 for one manifestation of this.
I think if you replace the
msg["Message-ID"] = "<%s>" % self.message_id ``` line at 178 in /opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py with
msg["Message-ID"] = "<%s>" % self.message_id.strip(' <>\r\n') ``` it will avoid this issue.
I tried this but get the same error, hope I did the modification correctly.
[2021-04-30 19:56:31 +0000] [22104] [ERROR] Error handling request Traceback (most recent call last): File "/opt/mailman/venv/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 180, in handle_request for item in respiter: File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/views/mlist.py", line 335, in stream_mbox msg = email.as_message() File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py", line 178, in as_message msg["Message-ID"] = "<%s>" % self.message_id.strip(' <>\r\n') File "/usr/lib/python3.7/email/message.py", line 409, in __setitem__ self._headers.append(self.policy.header_store_parse(name, val)) File "/usr/lib/python3.7/email/policy.py", line 145, in header_store_parse raise ValueError("Header values may not contain linefeed " ValueError: Header values may not contain linefeed or carriage return characters
Thanks. Andrew.
On 4/30/21 1:00 PM, Andrew Hodgson wrote:
I tried this but get the same error, hope I did the modification correctly.
[2021-04-30 19:56:31 +0000] [22104] [ERROR] Error handling request Traceback (most recent call last): File "/opt/mailman/venv/lib/python3.7/site-packages/gunicorn/workers/sync.py", line 180, in handle_request for item in respiter: File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/views/mlist.py", line 335, in stream_mbox msg = email.as_message() File "/opt/mailman/venv/lib/python3.7/site-packages/hyperkitty/models/email.py", line 178, in as_message msg["Message-ID"] = "<%s>" % self.message_id.strip(' <>\r\n')
This is the change I suggested, so you did it correctly.
File "/usr/lib/python3.7/email/message.py", line 409, in __setitem__ self._headers.append(self.policy.header_store_parse(name, val)) File "/usr/lib/python3.7/email/policy.py", line 145, in header_store_parse raise ValueError("Header values may not contain linefeed " ValueError: Header values may not contain linefeed or carriage return characters
I don't understand why that change didn't work, but you could try replacing that line with
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '',
self.message_id)
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 4/30/21 3:05 PM, Mark Sapiro wrote:
I don't understand why that change didn't work, but you could try replacing that line with
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '', self.message_id)
I tried to convince my MUA to not wrap that line, but I obviously failed. In any case it is one line,
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '', self.message_id)
but indented 8 spaces as are the immediately preceding and following lines.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '', self.message_id)
Thanks that did the trick. Will this need to go in the upstream release or is there any way I could re-import those messages into the database with a newer version of the import command from the original mbox from 2.1 to fix this?
Thanks. Andrew.
On 5/1/21 4:58 AM, Andrew Hodgson wrote:
Mark Sapiro wrote:
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '', self.message_id)
Thanks that did the trick. Will this need to go in the upstream release or is there any way I could re-import those messages into the database with a newer version of the import command from the original mbox from 2.1 to fix this?
Sorry for not following up sooner. This got deferred and then forgotten :(
I would like to see the offending message from the original mbox so I can develop an appropriate fix.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
On 5/1/21 4:58 AM, Andrew Hodgson wrote:
Mark Sapiro wrote:
msg["Message-ID"] = "<%s>" % re.sub('[ <>\r\n]', '', self.message_id)
Thanks that did the trick. Will this need to go in the upstream release or is there any way I could re-import those messages into the database with a newer version of the import command from the original mbox from 2.1 to fix this?
Sorry for not following up sooner. This got deferred and then forgotten :(
I would like to see the offending message from the original mbox so I can develop an appropriate fix.
What is the best way of getting that to you? You can download the mbox from your end of course, or I can revert the change so you can see where the original stopped.
Thanks. Andrew.
I received the mbox from Andrew. The issue is reported at https://gitlab.com/mailman/hyperkitty/-/issues/382 and fixed for the next HyperKitty release by https://gitlab.com/mailman/hyperkitty/-/merge_requests/346
This fix only fixes hyperkitty_import to not create these unexportable messages in the archive. It does nothing about such messages that have already been added to HyperKitty. That issue is reported at https://gitlab.com/mailman/hyperkitty/-/issues/383 and fixed for the next HyperKitty release by https://gitlab.com/mailman/hyperkitty/-/merge_requests/347
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (5)
-
Abhilash Raj
-
Andrew Hodgson
-
Jonathan M
-
Mark Sapiro
-
tlhackque