Re: Unicode conformant character encodings and us-ascii

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 07:51:29 EDT

  • Next message: Ben Dougall: "Re: character groupings in various languages"

    From: "Stefan Persson" <alsjebegrijptwatikbedoel@yahoo.se>
    > Are not BE and LE regarded as different encoding forms, making five
    > encoding forms (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE & UTF-32LE)?

    This was true up to Unicode 3.x, but not since Unicode 4.0.0 which makes a distinction between encoding forms (that are serialization of code points to an ordered list of fixed-width code units), and encoding schemes (that are serialization of these encoding forms to ordered streams of bytes, taking into account the byte ordering).

    So there are effectively three standardized encoding forms (UTF-8, UTF-16, UTF-32) that generate 7 standardized encoding schemes (UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32).

    There are also other non-standardized encoding forms in Unicode Technical reports (BOCU and SCSU which use 8-bit code units like UTF-8, and use a trivial serialization to bytes), plus an additional encoding scheme (CESU, which generates streams of bytes from the UTF-16 encoding form).

    UTF-7 is also described in a Technical Report and is an encoding form (to 7-bit code units) with trivial encoding scheme to bytes (not so trivial, because parity bits can be set for the high bit for streamed transmission purpose, that can safely be changed/ignored, so there are equivalent bytes).

    ISO2022, GB18030 and the new JIS standard are mostly viewed as encoding forms with a trivial identity encoding scheme because the code units are 8 bits.



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 08:33:29 EDT