Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Tom Lord (lord@emf.net)
Date: Thu Feb 22 2001 - 15:02:36 EST


   Peter_Constable@sil.org wrote:
   "Unicode is a character set encoding standard which currently provides for
   its entire character repertoire to be represented using 8-bit, 16-bit or
   32-bit encodings."

Please say "encoding forms".

There are three distinct terms, that sound similar, and apparently cause
confusion. All of them use the word "encoding".

Unicode is a "character set encoding", a mapping between code points
and abstract characters. Code points are roughly 21 bits and have
nothing to do with 8, 16, or 32 bit values.

The Unicode Standard defines several "encoding forms", which are rules
for representing code points as sequences of integers. UTF-8 uses
sequences of 8-bit integers, etc. These are used mostly in string
representations, file contents, and serializations for transmission.

The Unicode standard defines several "encoding schemes", which
are encoding forms, together with a choice of byte order, for those
encoding forms where byte order matters (UTF-16 and UTF-32, currently).

In my opinion, it is worth being fussy about the distinctions between
the three terms. If, thinking mostly about UTF-16 strings, you say
"Unicode has a 16-bit encoding" to someone just learning about the
character set, they are likely to think they can use 2^16 size
arrays, bitsets, and integer ranges to manipulate character data,
character sets, and characters -- without realizing the compromise
that decision entails. That was my initial reaction, anyway.

        "Confusion has its cost" -- Crosby, Stills, Nash, and Young

Raised by a roving pack of wild, pedantic mathematicians,
Thomas Lord



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT