RE: ASCII control codes in sequences of multibyte character sets

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 30 Aug 2013 21:11:29 +0000

Steffen,

Sure. You encounter this problem for any multi-byte EBCDIC-based
character encoding. In fact for any single-byte EBCDIC-based character
encoding, as well. The EBCDIC control that corresponds to a line feed is
either 0x15 or 0x25, depending on revisions. But you wouldn't ordinarily
run into EBCDIC-based data in a Unix environment these days.

All well-behaved multi-byte character encodings of any current ongoing
interest avoid control code values 0x00..0x1F as part of multi-byte graphic character
byte sequences. And essentially all Unix protocols treat 0x0D and 0x0A
in a uniform way based on their well-established ASCII usage. (Although
it is theoretically possible, even if a generally bad idea, to use other sets
of control functions for the 0x00..0x1F control codes in controlled contexts
like terminal display control.)

See Section 5.8 in the Unicode Standard for lots of information on
this problem in general:

http://www.unicode.org/versions/Unicode6.2.0/ch05.pdf

For way more information than you might actually want on all the
multi-byte character sets of possible interest here, you could start from:

http://en.wikipedia.org/wiki/ISO/IEC_2022

and follow the links to the discussion of the details of the various East Asian
character set details. Or you could get yourself a copy of Ken Lunde's
excellent book, CJKV Information Processing.

--Ken

> Hello character plus experts,
> i'm wondering wether there are any multibyte character sets known
> which use the numerical values of ASCII control characters that
> are vital to Unix/POSIX (plus) as part of multibyte sequences?
> In particular U+000A and U+000D?
> Thank you very much in advance (and don't forget to have a nice
> weekend, will ya?)
>
> --steffen
Received on Fri Aug 30 2013 - 16:13:51 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 30 2013 - 16:13:51 CDT