Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 11:50:00 CST

  • Next message: D. Starner: "Re: Unicode for words?"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
    >
    >>> The point is that indexing should better be O(1).
    >>
    >> SCSU is also O(1) in terms of indexing complexity...
    >
    > It is not. You can't extract the nth code point without scanning the
    > previous n-1 code points.

    The question is why you would need to extract the nth codepoint so blindly.
    If you have such reasons, because you know the context in which this index
    is valid and usable, then you can as well extract a sequence using an index
    in the SCSU encoding itself using the same knowledge.

    Linguistically, extracting a substring or characters at any random index in
    a sequence of code points will only cause you problems. In general, you will
    more likely use index as a way to mark a known position that you have
    already parsed sequentially in the past.

    However it is true that if you have determined a good index position to
    allow future extraction of substrings, SCSU will be more complex because you
    not only need to remember the index, but also the current state of the SCSU
    decoder, to allow decoding characters encoded starting at that index. This
    is not needed for UTF's and most legacy character encodings, or national
    standards, or GB18030 which looks like a valid UTF, even though it is not
    part of the Unicode standard itself.

    But remember the context in which this discussion was introduced: which UTF
    would be the best to represent (and store) large sets of immutable strings.
    The discussion about indexes in substrings is not relevevant in that
    context.



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 11:56:43 CST