Re: ASCII control codes in sequences of multibyte character sets

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 31 Aug 2013 20:19:49 +0100

On Fri, 30 Aug 2013 22:23:14 +0200
Steffen "Daode" Nurpmeso <sdaoden_at_gmail.com> wrote:

> Hello character plus experts,
> i'm wondering wether there are any multibyte character sets known
> which use the numerical values of ASCII control characters that
> are vital to Unix/POSIX (plus) as part of multibyte sequences?
> In particular U+000A and U+000D?

Infamously, UTF-16, as implied by Doug's mention of SCSU.

If you count fixed length (>1) character sets as multibyte, you can add
UCS-2 and UTF-32.

UTF-16 does have the property that characters occupy a multiple
of 2-bytes, so are well behaved in this respect if one knows to work
with aligned pairs of bytes rather than bytes, and if one knows the
endianity. Also, at present, U+0A00 and U+0D00 are unassigned.

Note that the old belief that U+FFFE would not occur externally to an
application has been decreed a fallacy, so an apparent U+FEFF or U+FFFE
at the start of a file from an external source only indicates the
endianity if one knows that file is encoded in the UTF-16 encoding
scheme as opposed to the UTF-16LE or UTF-16BE encoding scheme.

For UTF-32, reversing the bytes of a C0 control character would yield
an invalid byte seqeunce.

Richard.
Received on Sat Aug 31 2013 - 14:22:36 CDT

This archive was generated by hypermail 2.2.0 : Sat Aug 31 2013 - 14:22:38 CDT