24 Apr
2018
24 Apr
'18
5:19 a.m.
Mark Sapiro writes:
UnicodeDecodeError: 'gb2312' codec can't decode byte 0x87 in position 2: illegal multibyte sequence
I'm pretty sure 0x87 is reserved by the later GBK standard (implemented as an encoding in GB18030) as a prefix for extensions.
This is a common problem with Chinese and Japanese encodings: junky email clients use "traditional" names denoting encodings with smaller repertoires to denote modern encodings. I wonder if we shouldn't just promote all gb* to gb18030 and all Shift JIS to the most recent version, as in the WHAT-WG web encodings standard.
If nobody says "that's really stupid", I'll write up a RFE later this week. I think there's already a module on PyPI we can use to implement.
Steve