Re: Unicode FAQ addendum

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Jul 20 2000 - 05:09:46 EDT


There's no updating needed. The key is that The Unicode Standard, Version
3.0 recognizes UTF-16 as the default encoding. Therefore code values (or
units) which are defined as 'minimal bit combination that can represent a
unit of encoded text' are 16-bit. In UTF-16, one sometimes needs two of
these, instead of just one.

>|
>| C1 says "A process shall interpret Unicode code values as 16-bit
>| quantities."
>
>This I find mightily confusing. Why say something like this when
>there are (well, will be) characters that cannot be represented with
>16 bits in any of the Unicode encodings?

because the smallest unit of UTF-16 (which can represent characters outside
the first 64K) is 16-bit. See the full text of definition D5 on page 41.

>| "Code unit" is defined in definition D5 as a synonym for "code
>| value". If this needs updating,

This is not part of the definition, but part of the explanatory text
following D5.

>Unless I've really misunderstood something it does need updating.

No, not really. What you are intuitively looking for is what Unicode calls
Unicode Scalar Value (and which ranges from 0000 to 10FFFF, to use the new
convention of zero padding only up to 4 hex digits). Its definition is
buried in D28, since it was a term first needed in the definition of
Surrogates (section 3.7 on page 45).

BTW: the editors decided *not* to renumber definitions, as new ones were
added, so there are a few that don't come in the order that you might have
expected. Presumably the benefit is that, when we read old mail in the
future, we can still trace the discussion to the correct definition,
without having to have all the old versions of the book on hand.

For additional information, please have a look at
http://www.unicode.org/unicode/reports/tr17 Character Encoding Model
and
http://www.unicode.org/unicode/reports/tr19 UTF-32

A./

PS: Some of the confusion seems to come from the fact that people quote
partial definitions and others comment on them without reading the context
in the book. Now that Unicode 3.0 has been out for almost half a year, I'm
still surprised how many seriously involved people don't seem to have their
own copy.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT