Re: Unicode conformant character encodings and us-ascii

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 13:53:25 EDT

  • Next message: Marion Gunn: "RE: Decimal separator with more than one character?"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > Second, code points are not *serialized* into code units.
    > Serialization is an issue for encoding schemes, and is the
    > serialization of the code units into byte sequences. Again,
    > see Chapter 3 of The Unicode Standard, Version 4.0 for all
    > the details.

    You don't need to quote it, I hae already read it fully. Whatever you think code units are first defined for usage in memory, but the concept of "memory" is quite vague in Unicode, and in all modern OS'es, it is also a storage format (on disk, because of VM swaps), so memory storage is really and already a serialization (even if it is not seen immediately from the application code that uses these memory cells in a "grouped" or "aligned" way).

    Why would transmission be restricted to use bytes units ? In fact we could as well find further steps, because addressable 4 bit memory also exists in microcontrolers and this requires another ordering specification for nibbles. There also exists transmission interfaces that never work on byte units but only on unbreakale 16 bit or 32 bit entities.

    The distinction between code units and bytes is quite artificial in the Unicode specification (it just corresponds to common usage in microcomputers, and forgets the case of microcontrolers and mainframes or newer architectures that never handle data by byte units), so I think that the new distinction between encoding forms and encoding schemes is also artificial and assumes a microprocessor-only architecture.

    So I think it was an error to define concepts that do not exist in the ISO definition of encodings, and Unicode builds a classification of encodings by its own using distinctions that in practice are not necessary or assumes illegitimately a processing model. So I think that the idea of code units and encoding forms is just used as an internal way for Unicode only to define the steps necessary to produce the real/concrete encoding schemes.

    Even if you think about the final UTF-8 CES (encoding scheme), it may not be enough for transmission on networks or in other protocols, and mechanisms like ISO2022 may further apply encoding steps to make it fit with 7-bit environments (or even lower: just think about the IDNA encoding which uses a very restricted set of encoding values with only 37 symbols for compatibility with existing DNS specifications).

    > > One could argue that all *precisely defined* legacy character
    > > encodings (this includes the new GB2312 encoding)
    >
    > As Doug pointed out, Philippe probably means GB 18030 here.
    > GB 2312 is an *old* character encoding standard. It was published
    > in 1981.

    Sorry for the exact reference I could not immedately remember the exact number so I used the term "new GB2312", because GB18030 is a natural update of the past GB2312 and the informal intermediate proprietary encoding formats that where added by Microsoft on top of GB2312 without real agreement (so there were some incompatibilities between different vendors trying to implement these extensions, that GB18030 fixes in its standard, by endorsing the most common practices).



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 14:31:42 EDT