Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 11:50:00 CST

Next message: D. Starner: "Re: Unicode for words?"

Previous message: Philippe Verdy: "Re: Unicode for words?"
In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>
>>> The point is that indexing should better be O(1).
>>
>> SCSU is also O(1) in terms of indexing complexity...
>
> It is not. You can't extract the nth code point without scanning the
> previous n-1 code points.

The question is why you would need to extract the nth codepoint so blindly.
If you have such reasons, because you know the context in which this index
is valid and usable, then you can as well extract a sequence using an index
in the SCSU encoding itself using the same knowledge.

Linguistically, extracting a substring or characters at any random index in
a sequence of code points will only cause you problems. In general, you will
more likely use index as a way to mark a known position that you have
already parsed sequentially in the past.

However it is true that if you have determined a good index position to
allow future extraction of substrings, SCSU will be more complex because you
not only need to remember the index, but also the current state of the SCSU
decoder, to allow decoding characters encoded starting at that index. This
is not needed for UTF's and most legacy character encodings, or national
standards, or GB18030 which looks like a valid UTF, even though it is not
part of the Unicode standard itself.

But remember the context in which this discussion was introduced: which UTF
would be the best to represent (and store) large sets of immutable strings.
The discussion about indexes in substrings is not relevevant in that
context.

Next message: D. Starner: "Re: Unicode for words?"
Previous message: Philippe Verdy: "Re: Unicode for words?"
In reply to: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 11:56:43 CST