RE: FAQ

From: mark.davis@us.ibm.com
Date: Wed May 19 1999 - 14:06:12 EDT


Thanks, I'm glad you found it useful.

You bring up a good point about indexing. It sounds like the text was a bit
confusing--I'll try to restate it.

- If the storage is UTF-16, then UTF-16 indices are direct. To compute UCS-4
indices you parse from the start of the text.
- If the storage is UCS-4, then UCS-4 indices are direct. To compute UTF-16
indices you parse from the start of the text.
- Supporting surrogate pairs does not require using UCS-4 indices.

Here is a simple example of a routine that accesses surrogate pairs with UTF-16
indices, and returns them as UCS-4 characters (here called UTF-32):

    static final int OFFSET32 = 0x10000 - (0xD800 << 16) - 0xDC00;

    int getUTF32(int index16) {
        int ch32 = source.charAt(index16);
        if (0xD800 <= ch32 && ch32 <= 0xDBFF && index16+1 < source.length()) {
            char ch32Low = source.charAt(index16+1);
            if (0xDC00 <= ch32Low && ch32Low <= 0xDFFF) {
                ch32 = (ch32 << 16) + ch32Low + OFFSET32;
            }
        }
        return ch32;
    }

Here is a loop that uses it (there are a number of different ways to do this).

        for (int i = 0; i <= charIndex; ++i) {
            int ch32 = getUTF32(i);
            .... do something here ...
            if (ch32 > 0xFFFF) ++i; // adjust index for surrogates
        }

Mark
___
Mark Davis, IBM Center for Java Technology, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014

Yves Arrouye <Yves@centraal.com> on 05/18/99 11:24:04 PM

To: Mark Davis/Cupertino/IBM@IBMUS
cc:
Subject: RE: Unicode FAQ

> There is a draft Unicode FAQ on
> http://www.unicode.org/unicode/faq/. Although
> still preliminary, it should contain some useful information
> culled from a
> number of different sources.

It's great!

I do believe, though, that the part about UTF-16 and UCS-4, stating that
most implementation use UTF-16 and thus have efficient indexing, is
misleading. Most implementations do not handle surrogate pairs, and thus can
have efficient indexing since they assume that every character can be
represented with a single 16 bits word (UCS-2 (?)). If they were to handle
surrogate characters, they would have to do the same kind of string walking
to skip over surrogate paires that is shown in the sample code to use an
index based on a UCS-4 string view to access a string of characters stored
in UTF-16, not UCS-2.

Or am I just confused?
Yves.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT