Re: 32'nd bit & UTF-8

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Fri Jan 21 2005 - 02:42:51 CST

  • Next message: Antoine Leca: "Re: 32'nd bit & UTF-8"

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
    Behalf Of Hans Aberg
    Sent: 20 January 2005 20:47
    To: Antoine Leca; unicode@unicode.org
    Subject: Re: 32'nd bit & UTF-8

    > That already seems to have happened with GNU GCC, which fixes wchar_t to
    > 32-bits.

    and Microsoft Wisual C++, which fixes wchar_t to SIXTEEN bits.

    The existence of wchar_t does not imply UTF-32. It does imply UTF-16. It does
    not even imply Unicode. It's just a type. I quote from a version of stddef.h I
    found on the internet somewhere:

    "wchar_t: Integer type whose range of values can represent distinct
    wide-character codes for all members of the largest character set specified
    among the locales supported by the compilation environment: the null character
    has the code value 0 and each member of the portable character set has a code
    value equal to its value when used as the lone character in an integer
    character constant."

    Width is not specified, nor is encoding, nor is character set.

    gcc is but one platform. There are others.

    >> Hmmm... I don't recall that the Unicode Standard ever specifies that the
    >> Byte Order Mark is *required* to be used anywhere for any purpose. Can you
    >> point me to the place in the standard where this is stated?
    >
    > Several poster have cliamed that, most recently Arcane Jill. Check with
    > them. There is supposed to be a difference between a UTF-8 encoding not
    > requiring a BOM and a UTF-8 process requiring it.

    I am not a Unicode expert here. Like you, I'm a programmer. I mostly lurk here.
    I've just lurked here long enough to pick up a few things, but folk like ...
    well folks who actually get their names into Unicode documents ... /those/ guys
    are the real experts, the people to whom to listen. But anyway, from TUS,
    Chapter 15:

    "Systems that use the byte order mark must recognize when an initial U+FEFF
    signals the
    byte order. In those cases, it is not part of the textual content and should be
    removed before
    processing, because otherwise it may be mistaken for a legitimate zero width
    no-break space."

    Of course, Chapter 15 (Special Areas and Format Characters) is not Chapter 3
    (Conformance), so maybe it's not a conformance requirement as such? I don't
    know enough to know the difference.

    Jill



    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 02:45:13 CST