Re: Is the binaryness/textness of a data format a property? from Julian Bradfield via Unicode on 2020-03-21 (Unicode Mail List Archive)

From: Julian Bradfield via Unicode <unicode_at_unicode.org>
Date: Sat, 21 Mar 2020 20:38:24 +0000 (GMT)

On 2020-03-21, Eli Zaretskii via Unicode <unicode_at_unicode.org> wrote:
>> Date: Sat, 21 Mar 2020 11:13:40 -0600
>> From: Doug Ewell via Unicode <unicode_at_unicode.org>
>>
>> Adam Borowski wrote:
>>
>> > Also, UTF-8 can carry more than Unicode -- for example, U+D800..U+DFFF
>> > or U+11000..U+7FFFFFFF (or possibly even up to 2³⁶ or 2⁴²), which has
>> > its uses but is not well-formed Unicode.
>>
>> I'd be interested in your elaboration on what these uses are.
>
> Emacs uses some of that for supporting charsets that cannot be mapped
> into Unicode. GB18030 is one example of such charsets. The internal
> representation of characters in Emacs is UTF-8, so it uses 5-byte
> UTF-8 like sequences to represent such characters.

My own (now >10 year old) Unicode adaptation of XEmacs does the same,
even for charsets that can be mapped into Unicode. To ensure complete
backward compatibility, it distinguishes "legacy" charsets from Unicode,
and only does conversion when requested.
Received on Sat Mar 21 2020 - 15:38:47 CDT

This archive was generated by hypermail 2.2.0 : Sat Mar 21 2020 - 15:38:47 CDT