Re: Unicode conformant character encodings and us-ascii

From: Doug Ewell (
Date: Sat May 17 2003 - 18:44:46 EDT

  • Next message: Philippe Verdy: "Re: Decimal separator with more than one character?"

    Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

    > Whatever you
    > think code units are first defined for usage in memory, but the
    > concept of "memory" is quite vague in Unicode, and in all modern
    > OS'es, it is also a storage format (on disk, because of VM swaps), so
    > memory storage is really and already a serialization (even if it is
    > not seen immediately from the application code that uses these memory
    > cells in a "grouped" or "aligned" way).

    This doesn't even begin to be true. Virtual memory storage, which is
    ephemeral, doesn't have anything to do with the file formats in which
    data is written to disk.

    Memory images are highly OS- and platform-specific, and are not
    specified in Unicode. Serialization formats like UTF-8 are intended to
    be cross-platform, which is why they are specified.

    > Why would transmission be restricted to use bytes units ? In fact we
    > could as well find further steps, because addressable 4 bit memory
    > also exists in microcontrolers and this requires another ordering
    > specification for nibbles. There also exists transmission interfaces
    > that never work on byte units but only on unbreakale 16 bit or 32 bit
    > entities.

    Trying to redefine the basic data transmission model sounds like
    something out of Bytext. It definitely wouldn't promote the easy
    adoption of Unicode, regardless of whether it is technically possible.

    > The distinction between code units and bytes is quite artificial in
    > the Unicode specification (it just corresponds to common usage in
    > microcomputers, and forgets the case of microcontrolers and mainframes
    > or newer architectures that never handle data by byte units), so I
    > think that the new distinction between encoding forms and encoding
    > schemes is also artificial and assumes a microprocessor-only
    > architecture.

    Considering the prominent role of microprocessors and byte-oriented
    architectures in today's computing world, it hardly seems "artificial"
    to establish a formal relationship between code units and bytes.

    > So I think it was an error to define concepts that do not exist in the
    > ISO definition of encodings,

    The Unicode/ISO 10646 concepts *are* now the ISO definition.

    > and Unicode builds a classification of
    > encodings by its own using distinctions that in practice are not
    > necessary or assumes illegitimately a processing model. So I think
    > that the idea of code units and encoding forms is just used as an
    > internal way for Unicode only to define the steps necessary to produce
    > the real/concrete encoding schemes.

    The distinctions make sense. They didn't make sense to me when I first
    read about them, because I didn't have the necessary insight and
    perspective. Your day will come too.

    > Even if you think about the final UTF-8 CES (encoding scheme), it may
    > not be enough for transmission on networks or in other protocols, and
    > mechanisms like ISO2022 may further apply encoding steps to make it
    > fit with 7-bit environments

    ISO 2022 and Unicode live in separate worlds.

    > (or even lower: just think about the IDNA encoding which uses a very
    > restricted set of encoding values with only 37 symbols for
    > compatibility with existing DNS specifications).

    That's why the IDN working group didn't adopt UTF-8 as its encoding
    scheme, but instead adopted a TES called Punycode that uses only the
    permissible characters.

    It is true, as you say, that transfer encoding syntaxes are sometimes
    needed as a wrapper around character encoding schemes. That doesn't
    make it a conceptual error to distinguish between the two.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 19:23:43 EDT