Re: Unicode conformant character encodings and us-ascii

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 07:51:29 EDT

Next message: Ben Dougall: "Re: character groupings in various languages"

Previous message: Andrew C. West: "RE: how to sort by stroke (not radical/stroke)"
In reply to: Stefan Persson: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Doug Ewell: "Re: Unicode conformant character encodings and us-ascii"
Reply: Doug Ewell: "Re: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Stefan Persson" <alsjebegrijptwatikbedoel@yahoo.se>
> Are not BE and LE regarded as different encoding forms, making five
> encoding forms (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE & UTF-32LE)?

This was true up to Unicode 3.x, but not since Unicode 4.0.0 which makes a distinction between encoding forms (that are serialization of code points to an ordered list of fixed-width code units), and encoding schemes (that are serialization of these encoding forms to ordered streams of bytes, taking into account the byte ordering).

So there are effectively three standardized encoding forms (UTF-8, UTF-16, UTF-32) that generate 7 standardized encoding schemes (UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, UTF-32).

There are also other non-standardized encoding forms in Unicode Technical reports (BOCU and SCSU which use 8-bit code units like UTF-8, and use a trivial serialization to bytes), plus an additional encoding scheme (CESU, which generates streams of bytes from the UTF-16 encoding form).

UTF-7 is also described in a Technical Report and is an encoding form (to 7-bit code units) with trivial encoding scheme to bytes (not so trivial, because parity bits can be set for the high bit for streamed transmission purpose, that can safely be changed/ignored, so there are equivalent bytes).

ISO2022, GB18030 and the new JIS standard are mostly viewed as encoding forms with a trivial identity encoding scheme because the code units are 8 bits.

Next message: Ben Dougall: "Re: character groupings in various languages"
Previous message: Andrew C. West: "RE: how to sort by stroke (not radical/stroke)"
In reply to: Stefan Persson: "Re: Unicode conformant character encodings and us-ascii"
Next in thread: Doug Ewell: "Re: Unicode conformant character encodings and us-ascii"
Reply: Doug Ewell: "Re: Unicode conformant character encodings and us-ascii"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 08:33:29 EDT