Hello.
When running hyperkitty_import to import Mailman 2.1 archives, I frequently get errors like
django.db.utils.OperationalError: (1366, "Incorrect string value:
'\\xE1\\x90\\xA7\\x0A\\x0AO...' for column
mailman3web
.hyperkitty_email
.content
at row 1")
Is there a known cause for this? Is it fixed in some release more recent than 0+20180916?
Thanks.
-Dave
-- Dave Hall Binghamton University kdhall@binghamton.edu
On 5/6/23 11:27, Dave Hall via Mailman-users wrote:
Hello.
When running hyperkitty_import to import Mailman 2.1 archives, I frequently get errors like
django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE1\\x90\\xA7\\x0A\\x0AO...' for column
mailman3web
.hyperkitty_email
.content
at row 1")Is there a known cause for this? Is it fixed in some release more recent than 0+20180916?
I'm not sure what's going on here. My first thought is the database is MySQL or MariaDB and this is a 4-byte UTF-8 encoding and the database column charset definition is utf8 and not utf8mb4, but '\\xE1\\x90\\xA7\\x0A\\x0A' is the 3-byte UTF-8 encoding for CANADIAN SYLLABICS FINAL MIDDLE DOT ('\\xE1\\x90\\xA7') followed by two newlines. so I'm not sure what the issue is.
When you encounter this error, there should also be a message like "Message <Message-ID> failed to import, skipping" or perhaps "Failed adding message <Message-ID>: <error message>".
Using that Message-ID value, you should be able to find the original message in the input mbox. I would like to see that mbox message. You can remove any personal info. I only really need the message body and any Content-*: headers.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On 2023-05-07 07:39 Mark Sapiro writes:
On 5/6/23 11:27, Dave Hall via Mailman-users wrote:
django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE1\\x90\\xA7\\x0A\\x0AO...' for column
mailman3web
.hyperkitty_email
.content
at row 1")Is there a known cause for this?
Almost always nonconforming user clients. You write "frequently", but I'm guessing it's maybe 1% or 2% of all the messages, right? I suspect it's a particular author or institution using such a client.
Is it fixed in some release more recent than 0+20180916?
This is almost surely not something we can fix. Based on the error, it's in third party code (Django) that we import. It's possible we could provide a more explanatory message, but the only thing we can sensibly do is hold the message for an admin to edit it by hand. As long as the text in question and the Content-Type header aren't too sensitive, we can help with identifying charset (often UTF-8 as Mark suggests but why someone would be encoding FINAL MIDDLE DOT is a mystery to me).
I'm not sure what's going on here. My first thought is the database is MySQL or MariaDB and this is a 4-byte UTF-8 encoding and the database column charset definition is utf8 and not utf8mb4, but '\\xE1\\x90\\xA7\\x0A\\x0A' is the 3-byte UTF-8 encoding for CANADIAN SYLLABICS FINAL MIDDLE DOT ('\\xE1\\x90\\xA7') followed by two newlines. so I'm not sure what the issue is.
I'm guessing home-made client, possibly Asian or a spammer, which provides no content-type header. Even today I see raw 8-bit encoded text without an encoding spec from legitimate Japanese and Chinese sources, and of course with spammers all bets are off -- they might even do it deliberately to crash filters.
-- University of Tsukuba Faculty of Policy and Planning Sciences Tennodai 1-1-1, Tsukuba 305-8573 JAPAN tel/fax: +81-29-853-5091 turnbull@sk.tsukuba.ac.jp https://turnbull.sk.tsukuba.ac.jp/
Stephen J Turnbull wrote:
On 5/6/23 11:27, Dave Hall via Mailman-users wrote:
django.db.utils.OperationalError: (1366, "Incorrect string value: '\xE1\x90\xA7\x0A\x0AO...' for column mailman3web.hyperkitty_email.content at row 1") Is there a known cause for this? Almost always nonconforming user clients. You write "frequently", but I'm guessing it's maybe 1% or 2% of all the messages, right? I suspect it's a
Is it fixed in some release more recent than 0+20180916? This is almost surely not something we can fix. Based on the error, it's in third
I'm not sure what's going on here. My first thought is the database is MySQL or MariaDB and this is a 4-byte UTF-8 encoding and the database column charset definition is utf8 and not utf8mb4, but '\xE1\x90\xA7\x0A\x0A' is the 3-byte UTF-8 encoding for CANADIAN SYLLABICS FINAL MIDDLE DOT ('\xE1\x90\xA7') followed by two newlines. so I'm not sure what the issue is. I'm guessing home-made client, possibly Asian or a spammer, which
On 2023-05-07 07:39 Mark Sapiro writes: particular author or institution using such a client. party code (Django) that we import. It's possible we could provide a more explanatory message, but the only thing we can sensibly do is hold the message for an admin to edit it by hand. As long as the text in question and the Content-Type header aren't too sensitive, we can help with identifying charset (often UTF-8 as Mark suggests but why someone would be encoding FINAL MIDDLE DOT is a mystery to me). provides no content-type header. Even today I see raw 8-bit encoded text without an encoding spec from legitimate Japanese and Chinese sources, and of course with spammers all bets are off -- they might even do it deliberately to crash filters.
I have located an email with a similar error in a V2.1 .mbox file. The post was by a faculty member in my organization using Thunderbird 60.2.1 on a Mac - OSX 10.13. The list was used to correspond with students taking a programming course.
The message body is multi-part MIME with two sections:
This is a multi-part message in MIME format. --------------4BAB6CD38F4927471D366565 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit
and
--------------4BAB6CD38F4927471D366565 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
The hexadecimal sting \\xEF\\xBB\\xBF\\xEF\\xBB\\xBF appears one place in each section.
I'm not sure exactly what this means. I did check my MariaDB (as Mark hinted) and found that it set for utf8 rather than utf8mb4. I'd be willing to change this, but would I also have to do something to update the Mailman databases since those tables are already created?
Thanks.
-Dave
On Mon, 8 May 2023, at 15:21, Dave Hall via Mailman-users wrote:
I did check my MariaDB (as Mark hinted) and found that it set for utf8 rather than utf8mb4. I'd be willing to change this, but would I also have to do something to update the Mailman databases since those tables are already created?
I found this mentioned in a post by thoralf from 2020. Quoting from https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/message/...
I've checked - database, table and columns were utf8mb4 already, whereas the connection to the db apparently not. Adding 'charset': 'utf8mb4' to the options-dict in the databases-variable in /etc/mailman3/mailman-web.py fixed the issue - django passes this on to mysqlclient. case closed :)
I can confirm this (running MM3 on Debian 11).
-- -- Andreas
:-)
Hello.
This is a follow-on to my earlier post about hyperkitty_import Unicode Errors. I am progressing through a migration from Mailman 3.2.1 to Mailman 3.3.3 based on Debian packages. Since I have 3.2.1 instances running in production, I've created new databases for the 3.3.3 instances. I copied the Mailman3 DB to the new database and allowed the install process to upgrade the DB Schema. For the Mailman3-Web DB, due to the codepage issues, I did not initially copy the content. My thought was that I would try to re-import my v2.1 archives into a new database that has the correct Unicode settings.
However, this places me at square one with respect to users, permissions, etc.
What I would really like to do is to drop all of the previous archives and start fresh. Looking at the table list for the database, It would appear that all of the archive content is stored in the tables with names that start with 'hyperkitty_' I'm wondering if I could just wipe the content of these tables, or exclude them from a backup, and then start fresh.
Alternatively, if I restore the exiting archives to the new database with the correct unicode settings, could I re-run the hyperkitty-import tool and get it to pick up all of the messages it skipped due to the unicode problem?
Please advise.
Thanks.
-Dave
-- Dave Hall Binghamton University kdhall@binghamton.edu
On Sat, May 6, 2023 at 2:27 PM Dave Hall <kdhall@binghamton.edu> wrote:
Hello.
When running hyperkitty_import to import Mailman 2.1 archives, I frequently get errors like
django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xE1\\x90\\xA7\\x0A\\x0AO...' for column
mailman3web
.hyperkitty_email
.content
at row 1")Is there a known cause for this? Is it fixed in some release more recent than 0+20180916?
Thanks.
-Dave
-- Dave Hall Binghamton University kdhall@binghamton.edu
On 5/16/23 08:49, Dave Hall via Mailman-users wrote:
What I would really like to do is to drop all of the previous archives and start fresh. Looking at the table list for the database, It would appear that all of the archive content is stored in the tables with names that start with 'hyperkitty_' I'm wondering if I could just wipe the content of these tables, or exclude them from a backup, and then start fresh.
It is correct that all the HyperKitty tables start with hyperkitty_
. I
*think* it would be safe to just drop the content of these tables, but I
won't guarantee that there aren't django_mailman3_
tables with foreign
keys into the hyperkitty tables.
As a superuser logged in to hyperkitty, you should see a Delete Archive button at the top level overview for a list. It would be safer to delete the archives that way.
Alternatively, if I restore the exiting archives to the new database with the correct unicode settings, could I re-run the hyperkitty-import tool and get it to pick up all of the messages it skipped due to the unicode problem?
This will work too. You will need to specify the --since option to
hyperkitty_import because the default for since
is the date of the
newest message in the archive. The import will not reimport a message
whose Message-ID: is already in the archive.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (5)
-
Andreas Schamanek
-
Dave Hall
-
kdhall@binghamton.edu
-
Mark Sapiro
-
Stephen J Turnbull