Archive import fails: "IndexError: list index out of range"
I am now importing the archives of my old mailman2 installation.
Unfortunately, with some of the larger and older lists, import fails halfway through, with an exception, the last message of which is
IndexError: list index out of range
Below is an example.
This means that the last few years are missing from the archive in several lists.
Are there any suggestions what I could do to still get them importet?
Thanks!
Johannes
File "/usr/lib/python3.8/email/_header_value_parser.py", line 2069, in get_msg_id token, value = get_dot_atom_text(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 1334, in get_dot_atom_text raise errors.HeaderParseError("expected atom at a start of " email.errors.HeaderParseError: expected atom at a start of dot-atom-text but found '@localhost>'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/opt/mailman/venv/bin/mailman-web", line 8, in <module> sys.exit(main()) File "/opt/mailman/venv/lib/python3.8/site-packages/mailman_web/manage.py", line 37, in main execute_from_command_line(sys.argv) File "/opt/mailman/venv/lib/python3.8/site-packages/django/core/management/__init__.py", line 446, in execute_from_command_line utility.execute() File "/opt/mailman/venv/lib/python3.8/site-packages/django/core/management/__init__.py", line 440, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/opt/mailman/venv/lib/python3.8/site-packages/django/core/management/base.py", line 402, in run_from_argv self.execute(*args, **cmd_options) File "/opt/mailman/venv/lib/python3.8/site-packages/django/core/management/base.py", line 448, in execute output = self.handle(*args, **options) File "/opt/mailman/venv/lib/python3.8/site-packages/hyperkitty/management/commands/hyperkitty_import.py", line 387, in handle importer.from_mbox(mbfile, report_name) File "/opt/mailman/venv/lib/python3.8/site-packages/hyperkitty/management/commands/hyperkitty_import.py", line 212, in from_mbox progress_marker.tick(unquote(message.get("message-id", 'n/a'))) File "/usr/lib/python3.8/email/message.py", line 471, in get return self.policy.header_fetch_parse(k, v) File "/usr/lib/python3.8/email/policy.py", line 163, in header_fetch_parse return self.header_factory(name, value) File "/usr/lib/python3.8/email/headerregistry.py", line 607, in __call__ return self[name](name, value) File "/usr/lib/python3.8/email/headerregistry.py", line 202, in __new__ cls.parse(value, kwds) File "/usr/lib/python3.8/email/headerregistry.py", line 535, in parse kwds['parse_tree'] = parse_tree = cls.value_parser(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 2126, in parse_message_id token, value = get_msg_id(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 2073, in get_msg_id token, value = get_obs_local_part(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 1516, in get_obs_local_part if (obs_local_part[0].token_type == 'dot' or IndexError: list index out of range
Johannes Rohr writes:
IndexError: list index out of range
That isn't the problem, it's the earlier failure, I think:
File "/usr/lib/python3.8/email/_header_value_parser.py", line 2069, in get_msg_id token, value = get_dot_atom_text(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 1334, in get_dot_atom_text raise errors.HeaderParseError("expected atom at a start of " email.errors.HeaderParseError: expected atom at a start of dot-atom-text but found '@localhost>'
It appears that HyperKitty is trying to find a domain name (the common use for "dot_atom_text" in mail headers in a Message-ID), but is finding "@localhost>" instead.
First I would try the check_hk_import script which is provided with HyperKitty. (You may also want to use the Mailman 2.1 cleanarch script to check for unescaped 'From ' lines, or the script at https://www.msapiro.net/scripts/cleanarch2 which can do that and check for unparseable Date: headers as well.)
If that fails to identify the problem, try
grep -iE '^(Message-ID|In-Reply-To).*@localhost>' problem_mbox_file | wc -l
to see how how often that string appears, then
grep -iE '^(Message-ID|In-Reply-To).*@localhost>' problem_mbox_file | head -15
to see if you can identify the problem text.[1] The problem character before the "@" is probably "<", "@", or ".", but maybe it's one of these: ()<>[]:;@\,." (note the double quote is a disallowed character). Other ASCII punctuation are allowed in message IDs.
Once we know how the message id(s) is (are) malformed, we can discuss how to deal with them.
Steve
Footnotes: [1] In theory we should also check References but those usually have multiple continuations, which is annoying to deal with in grep.
Am 08.01.23 um 08:04 schrieb Stephen J. Turnbull:
[...] First I would try the check_hk_import script which is provided with HyperKitty. (You may also want to use the Mailman 2.1 cleanarch script to check for unescaped 'From ' lines, or the script at https://www.msapiro.net/scripts/cleanarch2 which can do that and check for unparseable Date: headers as well.)
Thanks a lot, Steve! Well, I was lazy and simply deleted all the messages that had been successfully imported and the one where the failure occurred, and after that, the import finished just fine.
Johannes
On 1/7/23 23:04, Stephen J. Turnbull wrote:
Johannes Rohr writes:
IndexError: list index out of range
That isn't the problem, it's the earlier failure, I think:
File "/usr/lib/python3.8/email/_header_value_parser.py", line 2069, in get_msg_id token, value = get_dot_atom_text(value) File "/usr/lib/python3.8/email/_header_value_parser.py", line 1334, in get_dot_atom_text raise errors.HeaderParseError("expected atom at a start of " email.errors.HeaderParseError: expected atom at a start of dot-atom-text but found '@localhost>'
It appears that HyperKitty is trying to find a domain name (the common use for "dot_atom_text" in mail headers in a Message-ID), but is finding "@localhost>" instead.
Actually, this occurs after the message has been converted to an email.message with policy=policy.default. Getting the Message-ID attempts to parse it and in this case, the dot_atom_text it's looking for is the 'local part' which doesn't exist. I.e. <xxx@localhost> would be OK, but <@localhost> is not.
First I would try the check_hk_import script which is provided with HyperKitty.
HyperKitty's contrib/check_hk_import script didn't find this issue. The GitLab branch was updated yesterday, but that update is not yet released. An up to date script which does find this issue is at <https://www.msapiro.net/scripts/check_hk_import>.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
I've got a message that keeps getting re-queued to the /hyperkitty/spool/ folder.
It is generating a similar error as above. Is there a way to purge such a message so it doesn't keep re-queuing?
"During handling of the above exception (expected atom at a start of dot-atom-text but found '[1db292f04b54496ab40544b0989a2122-JFBVALKQOJXWILKNK4YVA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>')"
Thanks again for your help.
-- Dan
Never mind! It looks like just moving it out of that spool directory stopped the re-queuing. The message appeared to be spammy anyway.
I spoke too soon. The message is back in the spool folder. Is it in the database? Is it possible to purge this message from being imported to Hyperkitty via Mailman shell commands?
Thanks in advance.
-- Dan
What version? What OS? Any more relevant backtrace details in mailmanweb.log?
--Jered
----- On Jan 4, 2024, at 2:35 PM, Dan Caballero dancab@caltech.edu wrote:
I spoke too soon. The message is back in the spool folder. Is it in the database? Is it possible to purge this message from being imported to Hyperkitty via Mailman shell commands?
Thanks in advance.
-- Dan
Mailman-users mailing list -- mailman-users@mailman3.org To unsubscribe send an email to mailman-users-leave@mailman3.org https://lists.mailman3.org/mailman3/lists/mailman-users.mailman3.org/ Archived at: https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/...
This message sent to jered@convivian.com
We're running GNU Mailman 3.3.8 (Tom Sawyer) on Debian Linux.
See below for what mailman.log had repeatedly. I removed the message from the hyperkitty spool directory again just before lunch today and it hasn't come back again. That may have done the trick. This is the first time I encounter this kind of issue. I looked at the .pck via mailman qfile and it definitely looked like some kind of junk mail from Microsoft so we don't need to recover it.
"Jan 04 10:54:10 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[1db292f04b54496ab40544b0989a2122-JFBVALKQOJXWILKNK4YVA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:25:11 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:26:03 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:27:52 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:29:01 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:31:02 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>) Jan 04 11:33:08 2024 (31780) archiving failed, re-queuing (mailing-list departmental_directory.caltech.edu, message <[569decf8204e4b869f4501ed8fbe379d-JFBVALKQOJXWILKCJQZFA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>)"
Dan Caballero writes:
I've got a message that keeps getting re-queued to the /hyperkitty/spool/ folder. [...] "During handling of the above exception (expected atom at a start of dot-atom-text but found '[1db292f04b54496ab40544b0989a2122-JFBVALKQOJXWILKNK4YVA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>')"
This is arguably a HyperKitty bug. The email package is correctly refusing to parse that as a Message-ID (it's been illegal syntax since RFC 822 if I remember correctly), but something, presumably HyperKitty, should catch the error and do something to sanitize it to message-id syntax. On the other hand, I never heard of a message trigger that exception before a couple months ago, so maybe it's the responsibility of spam filters.
I think when I ran into it, just moving it out of the HyperKitty archive queue fixed the queue. The client then put in a filter (it was indeed spam) in the MTA, and I haven't heard from them again.
In a later post you say that message came back, but I suspect it isn't the same message, but a new one likely from the same source.
I would say filter message-ids containing '[' either into /dev/null or a holding pen. Even Microsoft software knows how to generate syntacticly-valid message IDs, pretty sure that was not an @microsoft.com sender!
Steve
On 1/4/24 9:42 PM, Stephen J. Turnbull wrote:
Dan Caballero writes:
I've got a message that keeps getting re-queued to the /hyperkitty/spool/ folder. [...] "During handling of the above exception (expected atom at a start of dot-atom-text but found '[1db292f04b54496ab40544b0989a2122-JFBVALKQOJXWILKNK4YVA7CPNZSVMZLUPRCW2YLJNRKGK43UPRCXQ32TNV2HA===@microsoft.com]>')"
This is arguably a HyperKitty bug. The email package is correctly refusing to parse that as a Message-ID (it's been illegal syntax since RFC 822 if I remember correctly), but something, presumably HyperKitty, should catch the error and do something to sanitize it to message-id syntax. On the other hand, I never heard of a message trigger that exception before a couple months ago, so maybe it's the responsibility of spam filters.
This is <https://gitlab.com/mailman/mailman/-/issues/1065> fixed in Mailman 3.3.9 by <https://gitlab.com/mailman/mailman/-/merge_requests/1099>.
In a later post you say that message came back, but I suspect it isn't the same message, but a new one likely from the same source.
I agree.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (6)
-
Dan Caballero
-
Jered Floyd
-
Johannes Rohr
-
Mark Sapiro
-
Stephen J. Turnbull
-
Stephen J. Turnbull