Re: ASCII control codes in sequences of multibyte character sets

From: Steffen <sdaoden_at_gmail.com>
Date: Mon, 02 Sep 2013 13:54:09 +0200

Richard Wordingham <richard.wordingham_at_ntlworld.com> wrote:
 |> Hello character plus experts,
 |> i'm wondering wether there are any multibyte character sets known
 |> which use the numerical values of ASCII control characters that
 |> are vital to Unix/POSIX (plus) as part of multibyte sequences?
 |> In particular U+000A and U+000D?
 |
 |Infamously, UTF-16, as implied by Doug's mention of SCSU.
 |
 |If you count fixed length (>1) character sets as multibyte, you can add
 |UCS-2 and UTF-32.

Yes, but no :), i would count those as multi-octet rather than
multibyte character sets. They cannot be handled by Unix tools in
general anyway, (and of course), because of the (most likely
occuring) NUL bytes, which interferes the usual ISO C / POSIX
string handling, since that treats NUL as terminators.
And yes, i'm learning on this list.

 |Richard.

--steffen

attached mail follows:


On Fri, 30 Aug 2013 22:23:14 +0200
Steffen "Daode" Nurpmeso <sdaoden_at_gmail.com> wrote:

> Hello character plus experts,
> i'm wondering wether there are any multibyte character sets known
> which use the numerical values of ASCII control characters that
> are vital to Unix/POSIX (plus) as part of multibyte sequences?
> In particular U+000A and U+000D?

Infamously, UTF-16, as implied by Doug's mention of SCSU.

If you count fixed length (>1) character sets as multibyte, you can add
UCS-2 and UTF-32.

UTF-16 does have the property that characters occupy a multiple
of 2-bytes, so are well behaved in this respect if one knows to work
with aligned pairs of bytes rather than bytes, and if one knows the
endianity. Also, at present, U+0A00 and U+0D00 are unassigned.

Note that the old belief that U+FFFE would not occur externally to an
application has been decreed a fallacy, so an apparent U+FEFF or U+FFFE
at the start of a file from an external source only indicates the
endianity if one knows that file is encoded in the UTF-16 encoding
scheme as opposed to the UTF-16LE or UTF-16BE encoding scheme.

For UTF-32, reversing the bytes of a C0 control character would yield
an invalid byte seqeunce.

Richard.
Received on Mon Sep 02 2013 - 06:55:58 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 02 2013 - 06:55:59 CDT