Re: Last Call: UTF-16

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Aug 17 1999 - 19:45:47 EDT


Frank asked:

>
> > Your example of "doesn't handle BIDI" comes down to a question of
> > *if* your implementation interprets characters in the main Hebrew,
> > Arabic, Syriac, or Thaana blocks of the standard, and *if* it does
> > any display at all (as opposed to backend processing with no
> > display component), then it *must* conform to Unicode bidirectional
> > behavior, since that is part of the specified normative behavior
> > of characters from those blocks.
> >
> This is a topic of a separate thread -- should a terminal emulator
> that handles a Unicode data stream implement the BIDI algorithm? In cases
> where the companion host application needs precise control of the terminal
> screen, perhaps it should not.

This is partly a matter of where the "application" is, when it consists
of cooperating parts distributed across a network.

Technically speaking, a terminal emulation that is being driven by a
host data stream that incorporates Unicode data in its textual portion
is not an implementation of Unicode plain text. It is a higher-level
emulation protocol that sits on top of lots of rendered little bits and pieces
of Unicode plain text. And the understanding of how the bits and pieces
of text interact for display is all on the host side that assembles
the data stream to drive the terminal. So the "smarts" for the Unicode
bidirectional behavior could all be implemented on the host side, which
does all the (virtual) layout, and then packages that up into lines
and cursor positions that go out to the (dumb) terminal that merely
lines up glyph codes with font indices for display. The fact that
most of the glyph codes could be identical to their Unicode values is
just a shortcut taken in the rendering that is possible if you constrain
the list of scripts and characters you are willing to display.

So if the host is dealing with Unicode Arabic data, then *it* must
be cognizant of and conformant to the Unicode bidirectional algorithm,
but by the time you are pushing through the terminal driver, what
you have is not really "text" at all, though it may look like it, if
you constrain your domain, but rather a stream of glyph codes and positioning
controls.

>
> > But UTF-16BE, UTF-16LE, and UTF-16 are already in use in
> > various vendor and other protocols, and it would be nice if we could
> > get the naming problem out of the way, ditch "UCS-2" for good, and agree
> > on our labels.
> >
> Why is it bad to say "USC-2"? Somebody said this before and I didn't
> understand.
>
> Isn't the difference between UCS-2 and UTF-16 that the latter specifies
> a way to access the nonzero planes in 16 bits, whereas the former does
> not? So if an application only claims to handle the BMP, isn't it
> dealing with UCS-2?

UCS-2 and UCS-4 should be thought of as encoding forms. (ISO speak is
"coded representation form".) The number associated with an abstract
character in an character encoding is bound to a particular number of
bits (or octets). UCS-2 uses 16 bits (2 octets); UCS-4 uses 32 bits
(4 octets).

UTF-16 (and UTF-16BE and UTF-16LE) and UTF-8 are encoding schemes.
(ISO speak is "UCS transformation format"). They enable the mapping
of a particular character sequence to an explicit serialized sequence
of bytes. (Cf. the MIME charset concept.) The Unicode terminology is
that a UTF ("Unicode transformation format") "transforms each
Unicode scalar value into a unique sequence of code values."

Up until Unicode 1.1, before the introduction of UTF-16, Unicode and
UCS-2 were basically synonymous, since all Unicode characters were
16 bits. The unspoken missing piece from that time was an explicit
characterization of the encoding scheme for big-endian serialized
Unicode (i.e. UCS-2BE) and little-endian serialized Unicode
(i.e. UCS-2LE).

But with the introduction of Amendment 1 (now Annex C) UTF-16, and
its parallel adoption in Unicode 2.0, the Unicode Standard is defined
to *be* UTF-16, since it incorporates the interpretation of surrogate
pairs. It doesn't matter that no standard characters have been
assigned using surrogate pairs yet; user-defined characters are
already accessible via the surrogate pairs, and are part of the
mechanism of the standard.

The danger of saying that my application only handles the BMP, so it
is UCS-2, is a little like saying that I implement Shift-JIS, but I
only handle the ASCII part of the repertoire, so I might as well
call it ASCII.

This all may seem to be a quibble over terms, but next month WG2 will
likely start the ballotting for 10646-2, including a large number of
Chinese characters for Plane 2 -- some of which (the Hong Kong set)
will likely be early-implemented for Asian support. We may as well
start getting the terms right now, since surrogate pair handling is
just around the bend, even for implementations that think they will
never deal with combining characters, bidi, or anything else exotic
or fancy from the Unicode repertoire.

--Ken

>
> - Frank
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT