Re: UTF-16 inside UTF-8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 04 2003 - 12:29:20 EST

  • Next message: YTang0648@aol.com: "Re: UTF-16 inside UTF-8"

    From: "David E. Hollingsworth" <deh@fastanimals.com>
    > I believe this is described pretty well in sections 3.8 & 3.9 (plus
    > conformance requirement C12b) of Unicode 4.0.
    >
    > Surrogate pairs are for UTF-16 only. For UTF-8 & UTF-32, surrogates
    > (pairs or otherwise) are ill-formed code unit sequences, and
    > conformant processes must treat them as erroneous.

    Well, it is effectively forbidden to encode surrogates in UTF-8, but not in
    CESU-8, where it is the only allowed method to encode characters out of the
    BMP.

    Still, in CESU-8, there are also ill-formed sequences:
    - those that are using encoded sequences of more than 3 bytes for code
    points out of the BMP
    - those that are using unpaired surrogate code points.

    This second form however is quite common in legacy applications that allow
    unpaired surrogate code points, handled as if they were coding individual
    characters. This is allowed for internal string handlings (typical in Java,
    and most C/C++ applications that map the wchar_t to a 16-bit integer), but
    texts should not be interchanged with CESU-8 that contain unpaired surrogate
    code points.



    This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:20:29 EST