RE: FAQ

From: Yves Arrouye (Yves@centraal.com)
Date: Fri May 21 1999 - 00:46:14 EDT

Next message: PAUL BAKER: "Unicode corpus tools/missing characters"
Previous message: Mark Davis: "Re: FAQ"
Maybe in reply to: mark.davis@us.ibm.com: "RE: FAQ"
Next in thread: Rick McGowan: "Re: RE: FAQ"
Maybe reply: Rick McGowan: "Re: RE: FAQ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Yes, that is what I said.
>
> "- If the storage is UTF-16, then UTF-16 indices are direct. To compute
UCS-4 indices you parse
> from the start of the text."
>
> Your example is UTF-16 text, so the UCS-4 indices are *not*
direct--accessing a random UCS-4 index
> requires scanning from the start of the text. Here are the direct UTF-16
indices, plus the UCS-4 indices
> computed by parsing from the start.
>
> text: s o m e <s1> <s2> t e x t <s1> <s2>
> UTF-16: 0 1 2 3 4 5 6 7 8 9 10 11 12
> UCS4: 0 1 2 3 4 5 6 7 8 9 10
>
> So the 8th UCS-4 code value is "x", while the 8th UTF-16 code value is
"e".
>
> Does that answer your question?

It does. I took the indexes as being machine-word-based (2 bytes for UTF-16,
4 for UCS-4), not character based. If they're character based, then yes the
access is direct though the mapping from the index to the actual range of
bytes representing the character is not.

Yves.

Next message: PAUL BAKER: "Unicode corpus tools/missing characters"
Previous message: Mark Davis: "Re: FAQ"
Maybe in reply to: mark.davis@us.ibm.com: "RE: FAQ"
Next in thread: Rick McGowan: "Re: RE: FAQ"
Maybe reply: Rick McGowan: "Re: RE: FAQ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT