From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 11:50:00 CST
From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>
>>> The point is that indexing should better be O(1).
>>
>> SCSU is also O(1) in terms of indexing complexity...
>
> It is not. You can't extract the nth code point without scanning the
> previous n-1 code points.
The question is why you would need to extract the nth codepoint so blindly.
If you have such reasons, because you know the context in which this index
is valid and usable, then you can as well extract a sequence using an index
in the SCSU encoding itself using the same knowledge.
Linguistically, extracting a substring or characters at any random index in
a sequence of code points will only cause you problems. In general, you will
more likely use index as a way to mark a known position that you have
already parsed sequentially in the past.
However it is true that if you have determined a good index position to
allow future extraction of substrings, SCSU will be more complex because you
not only need to remember the index, but also the current state of the SCSU
decoder, to allow decoding characters encoded starting at that index. This
is not needed for UTF's and most legacy character encodings, or national
standards, or GB18030 which looks like a valid UTF, even though it is not
part of the Unicode standard itself.
But remember the context in which this discussion was introduced: which UTF
would be the best to represent (and store) large sets of immutable strings.
The discussion about indexes in substrings is not relevevant in that
context.
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 11:56:43 CST