Re: UTF-8 and UTF-16 issues

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Sun Jun 25 2000 - 21:03:39 EDT


At 2:48 PM -0800 6/19/00, Markus Scherer wrote:
>"OLeary, Sean (NJ)" wrote:
> > UTF-16 is the 16-bit encoding of Unicode that includes the use of
> > surrogates. This is essentially a fixed width encoding.
>
>certainly not. utf-16, of course, is variable-width: 1 or 2 16-bit
>units per character. certainly the iuc discussion did not spread
>this under "utf-16" but possibly as "ucs-2".
[snip]

The essential distinction that Sean refers to is not that all
characters are encoded in the same length, but that all coding
elements are of the same length. This is in contrast not with ISO
10646, but with "double-byte" encodings of CJK text, where escape
sequences are used to switch between runs of 8-bit and 16-bit codes.

The point of the distinction is that in double-byte encodings the
only way to tell the length of the current character is by parsing
from the beginning of the file. In Unicode, the current 16-bit value
is explicitly a 16-bit character code (assigned, unassigned, or
Private Use), an upper surrogate code, a lower surrogate code, or not
a character code, without reference to what has gone before in the
file.

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT