From: Doug Ewell (firstname.lastname@example.org)
Date: Sat May 17 2003 - 18:44:46 EDT
Philippe Verdy <verdy_p at wanadoo dot fr> wrote:
> Whatever you
> think code units are first defined for usage in memory, but the
> concept of "memory" is quite vague in Unicode, and in all modern
> OS'es, it is also a storage format (on disk, because of VM swaps), so
> memory storage is really and already a serialization (even if it is
> not seen immediately from the application code that uses these memory
> cells in a "grouped" or "aligned" way).
This doesn't even begin to be true. Virtual memory storage, which is
ephemeral, doesn't have anything to do with the file formats in which
data is written to disk.
Memory images are highly OS- and platform-specific, and are not
specified in Unicode. Serialization formats like UTF-8 are intended to
be cross-platform, which is why they are specified.
> Why would transmission be restricted to use bytes units ? In fact we
> could as well find further steps, because addressable 4 bit memory
> also exists in microcontrolers and this requires another ordering
> specification for nibbles. There also exists transmission interfaces
> that never work on byte units but only on unbreakale 16 bit or 32 bit
Trying to redefine the basic data transmission model sounds like
something out of Bytext. It definitely wouldn't promote the easy
adoption of Unicode, regardless of whether it is technically possible.
> The distinction between code units and bytes is quite artificial in
> the Unicode specification (it just corresponds to common usage in
> microcomputers, and forgets the case of microcontrolers and mainframes
> or newer architectures that never handle data by byte units), so I
> think that the new distinction between encoding forms and encoding
> schemes is also artificial and assumes a microprocessor-only
Considering the prominent role of microprocessors and byte-oriented
architectures in today's computing world, it hardly seems "artificial"
to establish a formal relationship between code units and bytes.
> So I think it was an error to define concepts that do not exist in the
> ISO definition of encodings,
The Unicode/ISO 10646 concepts *are* now the ISO definition.
> and Unicode builds a classification of
> encodings by its own using distinctions that in practice are not
> necessary or assumes illegitimately a processing model. So I think
> that the idea of code units and encoding forms is just used as an
> internal way for Unicode only to define the steps necessary to produce
> the real/concrete encoding schemes.
The distinctions make sense. They didn't make sense to me when I first
read about them, because I didn't have the necessary insight and
perspective. Your day will come too.
> Even if you think about the final UTF-8 CES (encoding scheme), it may
> not be enough for transmission on networks or in other protocols, and
> mechanisms like ISO2022 may further apply encoding steps to make it
> fit with 7-bit environments
ISO 2022 and Unicode live in separate worlds.
> (or even lower: just think about the IDNA encoding which uses a very
> restricted set of encoding values with only 37 symbols for
> compatibility with existing DNS specifications).
That's why the IDN working group didn't adopt UTF-8 as its encoding
scheme, but instead adopted a TES called Punycode that uses only the
It is true, as you say, that transfer encoding syntaxes are sometimes
needed as a wrapper around character encoding schemes. That doesn't
make it a conceptual error to distinguish between the two.
This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 19:23:43 EDT