Re: 32'nd bit & UTF-8

From: Arcane Jill (
Date: Fri Jan 21 2005 - 07:32:38 CST

  • Next message: Clark Cox: "Re: 32'nd bit & UTF-8"

    -----Original Message-----
    From: Philippe Verdy []
    Sent: 21 January 2005 13:06
    To: Arcane Jill
    Subject: Re: 32'nd bit & UTF-8

    >Arcane Jill <> a écrit :
    >> The existence of wchar_t does not imply UTF-32. It does imply UTF-16.

    That was a typo of course. It should have read "It does NOT imply UTF-16".

    > I like this definition. but what is interesting here are the phrases
    > "character set" and "supported by the compilation environment".
    > "character set": the definition implies that this is necessarily a
    > *coded* character set, because it makes an equation between what it
    > calls a "character" and a "integer character constant". Unfortunately,
    > the definition of "character" is weak. It does not have the same
    > meaning as the "abstract character" defined in Unicode/ISO/IEC, so it
    > could map to Unicode's "code units". This would make UTF-16 suitable.
    > But if needs to match with "abstract characters", then there's no
    > choice for a C++ compiler: the integer datatype representing "wchar_t"
    > must be able to contain at least as many distinct values as the ISO/IEC
    > 10646 repertoire, and must contain the value 0.

    Well, wchar_t on Windows is 16-bits wide, and hence /not/ able to contain as
    many distinct values as the ISO/IEC 10646 repertoire. Gotta be code units then.

    > The definition also does not say that the value 0 will necessarily be
    > the same as a NULL character (U+0000). This depends on the "supported
    > character set" in compile-time locales. There may as well exist a
    > supported encoded charset that maps U+0000 to the integer value -2
    > (because there's no requirement that integer values match ISO/IEC 10646
    > codepoints). The definition relates only to the "null character" i.e.
    > the one that "\0" maps to in string or character constants, but makes
    > no assumption about if this null matches the ISO10646 NULL (U+0000)
    > character.

    It is fortunate, then, that C was never implemented on the ZX80 or ZX81, for
    which '\0' would have been the SPACE character (U+0020). (See On the ZX80/81, every
    space would have terminated a string!

    Fun, eh?

    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 07:40:05 CST