Henrik Rasmussen writes:
It seems that this issue is still unsolved. Is there any estimated date for a fix? https://gitlab.com/mailman/hyperkitty/issues/155 .
Problem-oriented executive summary:
Why do you think this is a problem that Mailman should solve? What is your desired treatment of this data?
tl;dr for Unicode wonks and wonkabees:
My personal take is similar to Simon's.
Unicode NULs are rather dangerous, because any time you convert to an ASCII-compatible encoding and store it where a C routine can get at it, you risk data corruption when a C string function decides that the NUL is end-of-string, and ignores the tail of the string. To deal with NULs embedded in an array of C chars, you have to forego the whole suite of str*[1], *printf, etc stdlib functions and either (A) dive down to mem* functions, keeping accurate char[] lengths on the side, or (B) create a complete suite of array-handling functions that deal with this for you.
I agree with Simon's guess that you probably have corrupted data or metadata that you're passing to HyperKitty. (By "corrupted metadata" I mean that you've got binary data labelled with "text", or that you have text encoded in something like UTF-16 that normally includes 0x00 *bytes* as a component of non-NUL *characters*, but labelled with an ASCII- compatible encoding such as ISO 8859/1. The latter is a common hack to insert binary data, or "any-encoded" text, into text databases.) This is based on the long tradition of avoiding NULs in string data based on both their standard interpretation as ASCII (memory padding which is to be *ignored*[2]) and the danger that they pose to C programs that use the C stdlib to handle "character" data.
It's true that both Mailman 3 suite and Django are written in Python, which has implemented NUL-handling strategy (B) for us. However, Mailman suite eventually calls into a few C (or C++) libraries, especially the database backends, as in this issue. We can't do anything about those backends, of course. Filtering the incoming data is something we could do, I guess, but as Simon points out that would be pretty expensive, and filtering erroneous NULs silently would likely result in inserting corrupt data in the databases.
If you're really sure that you have valid ASCII (or Unicode) NULs embedded in your text data, you could try configuring an alternative backend (both for Mailman core and for Django) that handles NULs as valid characters. I don't know if there are any, let alone whether we support any, though.
Footnotes: [1] Including the strn* functions! They protect against buffer *overruns*, but NUL is still end-of-string.
[2] Here's what ECMA-48 "Control Functions for Coded Character Sets", which I believe to be identical to ISO 6429 except that ECMA standards are free to read and ISO charges about $100 for this, says about NUL:
8.3.88 NUL - NULL
Notation: (C0)
Representation: 00/00
NUL is used for media-fill or time-fill. NUL characters may be
inserted into, or removed from, a data stream without affecting
the information content of that stream, but such action may affect
the information layout and/or the control of equipment.
Unicode doesn't say how to interpret NUL at all, but does recommend ISO 6429 as one good way to interpret control characters not otherwise dealt with in Unicode (and it's the only way that Unicode mentions).