Re: UTF-32 and Pfennig

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Fri Jul 23 1999 - 09:48:27 EDT


Torsten Mohrin wrote on 1999-07-23 11:19 UTC:
> Could you please clarify the difference between the terms "UCS-4" and
> "UTF-32".

http://www.unicode.org/unicode/reports/tr19/

In essence UTF-32 is UCS-4 restricted to characters below U+110000.

I personally think the idea is a load of matter that millions of flies
will love:

  - UTF-32 is primarily a political correctness exercise to keep the UTF-16
    crowd happy, which still feels morally threatened by the idea that
    UCS is a 31-bit set and the fact that the competition (i.e., UTF-8)
    covers it all. (Note that TR #19 explicitly declares UTF-16BE
    *and* UTF-16LE to be the preferred representation forms and mentions
    UTF-8 only for legacy applications, which indicates clearly, which
    camp TR #19 comes from.)

  - UTF-32 makes the same mistake as UTF-16 in allowing both a big endian
    and a little endian form, with a BOM to signal the two, as opposed to
    making life simple by decreeing one to be the chosen one (bigendian
    of course).

  - If UTF-32If is really used as an encoding somewhere, then it wastes
    25% of all bits. One should instead specify a 24-bit format, and the
    processing requirements to read it into 32-bit registers will nullify
    any difference between LE and BE, such that we can happily agree
    on BE (as we did in UTF-8, which is after all a bigendian format).

Note that bigendian encodings have for a long time now been the common
practice in all standards related to data exchange in heterogenous
systems. I do not know of a single ISO or ITU standard that uses
littleendian anywhere. ASN.1/BER, JBIG, JPEG, MPEG, XDR, CORBA IIOP, and
many more all use bigendian. Littleendian standards where so far
Microsoft product related formats only. Unicode is the only standard
that I know of that tries to levitate littleendian streams of 16-bit
integers values to some higher form of legitimacy. I am not happy about
this. I think, Microsoft could have very easily done a byte swap on load
time, such that UTF16-LE never ever showed up in files and network
packets. Note that the Pentium has a BSWAP machine instruction that can
do this as an additional pipeline step with zero overhead, so there is
certainly no performance justification for not using UTF-16BE everywhere
in external byte streams. Even without BSWAP, the cost of exchanging two
bytes is still negligible compared to the cost of getting these bytes
into the L1 cache in the first place.

> My father (who is 70) and my grandmother (who is 90) still know the
> Pfennig symbol very well.

Hm, sounds to me like a good candidate for Plane 1 next to the ancient
Egyptian scripts. In any case, I think the addition of the GERMAN
PFENNIG SYMBOL is a very good sign. It indicates that we are now
starting to run out of good ideas for useful characters that are still
missing in Unicode 3.0. :-)

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT