RE: Wide Characters in Windows and UTF16

From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Thu Aug 12 2004 - 12:36:48 CDT

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Combining across markup?"

    Hi, Markus

    Hardly misleading! You can, of course, view UTF-16 data in memory as an
    array of 16-bit code units. But you can also view it as an array of bytes.
    This might not be a good idea, but it is necessary occasionally.

    When a UTF-16 string is treated as an array of bytes, it's supremely
    important to know the byte order. The OP asked about byte order, and seemed
    to me to be referring to data in memory. Hence my answer.

    Cheers

    - rick

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    Behalf Of Markus Scherer
    Sent: August 12, 2004 9:19
    To: unicode
    Subject: Re: Wide Characters in Windows and UTF16

    Rick Cameron wrote:
    > Microsoft Windows uses little-endian byte order on all platforms.
    > Thus, on Windows UTF-16 code units are stored in little-endian byte order
    in memory.
    >
    > I believe that some linux systems are big-endian and some
    > little-endian. I think linux follows the standard byte order of the
    > CPU. Presumably UTF-16 would be big-endian or little-endian accordingly.

    This is somewhat misleading. For internal processing, where we are talking
    about the UTF-16 encoding form (quite different from the external encoding
    _scheme_ of the same name), we don't have strings of bytes but strings of
    16-bit units (WCHAR in Windows). Program code operating on such strings
    could not care less what endianness the CPU uses. Endianness is only an
    issue when the text gets byte-serialized, as is done for the external
    encoding schemes (and usually by a conversion service).

    markus



    This archive was generated by hypermail 2.1.5 : Thu Aug 12 2004 - 12:37:45 CDT