Re: Last Call: UTF-16, an encoding of ISO 10646 to Informational

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Mon Aug 16 1999 - 13:15:09 EDT


Frank da Cruz <fdc@watsun.cc.columbia.edu>:

FdC> Granted it's not a big deal to swap bytes,

I think that the fact that it's easy to swap bytes is the strongest
reason why only one version of UTF-16 should be registered.

While, say, ISO 8859-1 and CP-1252 have the same field of application,
there is an argument to be made for CP-1252 to be registered, as users
of Windows machines probably want to send e-mail without losing
quotation marks and suchlike.

On the other hand, there is no reason whatsoever to register two
versions of UTF-16. As conversion is both lossless and trivial[1], it
is quite reasonable for machines to systematically convert into a
common order (call it Network Order if you wish) before sending data
on the wire.

In other words, registering two formats for UTF-16 complicates matters
while bringing no benefits whatsoever.

(To clarify: I do not have any preference for one canonical order over
the other. Just pick one and stick to it consistently.)

As to the registration of a BOM-ful format of UTF-16... please, give
me a break.

Sincerely,

                                        J.

[1] Depending on the architecture, anywhere between one instruction
per halfword (exchange or rotate) and five instructions (move, shift,
shift, and, or). Assuming the data is already in the cache, we're
probably speaking of a conversion of around 100 MB of data per second
on a half-recent machine.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT