L2/01-068

From: Peter_Constable@sil.org
Sent: Thursday, January 25, 2001 9:24 AM

Subject: Agenda Item B.7.4: TR 19 - COMMENTS


From section 1 of UTR 19 draft revision:

<quote>
Different encoding forms of Unicode are useful in different system
environments. For example, UTF-32 is somewhat simpler in usage than UTF-16,
but in almost all cases occupies twice the storage. A common strategy is to
have internal string storage use UTF-16 or UTF-8, but use UTF-32 for
individual character datatypes. Note that UTF-32 does not necessarily match
user-expectations for "characters", which are better matched by grapheme
boundaries, as explained in Chapter 5 of the Unicode Standard.
</quote>

The last sentence is confusing. To me, it's hard to understand because the
previous sentences are about distinctions between UTF-32 and other encoding
forms. If I understand what the last sentence is intending to say, it isn't
making a statement that's limited to just UTF-32: no matter what encoding
form is used, identifying the next Unicode character doesn't mean you've
identified the next text unit that is meaningful to the user.

Perhaps the point is that there is a slightly different situation from
UTF-16 and UTF-8: every 32-bit code unit corresponds to a Unicode character
(i.e. Unicode Scalar Value), whereas not every 16-bit code unit in UTF-16
corresponds to a Unicode character (in the case of surrogates), and
similiarly for UTF-8. If this is what is needing to be said, then that
should be stated clearly. At present, I don't think the intended meaning,
whatever it is, is clear at all.


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>