Re: FAQ

From: Mark Davis (mark@macchiato.com)
Date: Thu May 20 1999 - 09:59:32 EDT


Yes, that is what I said.

"- If the storage is UTF-16, then UTF-16 indices are direct. To compute UCS-4 indices you parse from the start of the text."

Your example is UTF-16 text, so the UCS-4 indices are *not* direct--accessing a random UCS-4 index requires scanning from the start of the text. Here are the direct UTF-16 indices, plus the UCS-4 indices computed by parsing from the start.

text: s o m e <s1> <s2> t e x t <s1> <s2>
UTF-16: 0 1 2 3 4 5 6 7 8 9 10 11 12
UCS4: 0 1 2 3 4 5 6 7 8 9 10

So the 8th UCS-4 code value is "x", while the 8th UTF-16 code value is "e".

Does that answer your question?

Mark

----- Original Message -----
From: Yves Arrouye <Yves@centraal.com>
To: Unicode List <unicode@unicode.org>
Sent: Wednesday, May 19, 1999 11:13 AM
Subject: RE: FAQ

>
> > - If the storage is UTF-16, then UTF-16 indices are direct.
> > To compute UCS-4
> > indices you parse from the start of the text.
> > - If the storage is UCS-4, then UCS-4 indices are direct. To
> > compute UTF-16
> > indices you parse from the start of the text.
> > - Supporting surrogate pairs does not require using UCS-4 indices.
> >
> > Here is a simple example of a routine that accesses surrogate
> > pairs with UTF-16
> > indices, and returns them as UCS-4 characters (here called UTF-32):
>
> Ok. It works in a loop, but you can't provide random-access to the string,
> right? Suppose I have, stored on 16 bits, accessible through an str
> variable:
>
> s o m e <s1> <s2> t e x t <s1> <s2>
>
> (<s1> <s2> is a surrogate pair). I do have 12 words of useful information,
> and only 10 characters. So when I say:
>
> str.getAt(7)
>
> and I mean the 8th character, not the 8th word of storage, I do need to walk
> the string in order to get 'x' and not 'e'. The indices don't seem direct
> then.
>
> Yves.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT