L2/01-068 From: Peter_Constable@sil.org Sent: Thursday, January 25, 2001 9:24 AM Subject: Agenda Item B.7.4: TR 19 - COMMENTS From section 1 of UTR 19 draft revision: Different encoding forms of Unicode are useful in different system environments. For example, UTF-32 is somewhat simpler in usage than UTF-16, but in almost all cases occupies twice the storage. A common strategy is to have internal string storage use UTF-16 or UTF-8, but use UTF-32 for individual character datatypes. Note that UTF-32 does not necessarily match user-expectations for "characters", which are better matched by grapheme boundaries, as explained in Chapter 5 of the Unicode Standard. The last sentence is confusing. To me, it's hard to understand because the previous sentences are about distinctions between UTF-32 and other encoding forms. If I understand what the last sentence is intending to say, it isn't making a statement that's limited to just UTF-32: no matter what encoding form is used, identifying the next Unicode character doesn't mean you've identified the next text unit that is meaningful to the user. Perhaps the point is that there is a slightly different situation from UTF-16 and UTF-8: every 32-bit code unit corresponds to a Unicode character (i.e. Unicode Scalar Value), whereas not every 16-bit code unit in UTF-16 corresponds to a Unicode character (in the case of surrogates), and similiarly for UTF-8. If this is what is needing to be said, then that should be stated clearly. At present, I don't think the intended meaning, whatever it is, is clear at all. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: