Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 09:33:40 CST

  • Next message: Philippe Verdy: "Re: script complexity, was Re: OpenType vs TrueType"

    ----- Original Message -----
    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    To: <unicode@unicode.org>
    Sent: Sunday, December 05, 2004 1:37 AM
    Subject: Re: Nicest UTF

    > "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
    >
    >> There's nothing that requires the string storage to use the same
    >> "exposed" array,
    >
    > The point is that indexing should better be O(1).

    SCSU is also O(1) in terms of indexing complexity... simply because it keeps
    the exact equivalence with codepoints, and requires a *fixed* (and small)
    number of steps to decode it to code points, but also because the decoder
    states uses a *fixed* (and small) number of variables for the internal
    context (unlike more powerful compression algorithms like dictionnary-based,
    Lempel-Ziv-Welsh-like, algorithms such as deflate).

    > Not having a constant side per code point requires one of three things:
    > 1. Using opaque iterators instead of integer indices.
    > 2. Exposing a different unit in the API.
    > 3. Living with the fact that indexing is not O(1) in general; perhaps
    > with clever caching it's good enough in common cases.
    >
    > Altough all three choices can work, I would prefer to avoid them.
    > If I had to, I would probably choose 1. But for now I've chosen a
    > representation based on code points.
    >
    >> Anyway, each time you use an index to access to some components of a
    >> String, the returned value is not an immutable String, but a mutable
    >> character or code unit or code point, from which you can build
    >> *other* immatable Strings
    >
    > No, individual characters are immutable in almost every language.
    But individual characters do not always have any semantic. For languages,
    the relevant unit is almost always the grapheme cluster, not the character
    (so not its code point...). As grapheme clusters need to be represented on
    variable lengths, an algorithm that could only work with fixed-width units
    would not work internationaly or would cause serious problems for correct
    analysis or transformation of true languages.

    > Assignment to a character variable can be thought as changing the
    > reference to point to a different character object, even if it's
    > physically implemented by overwriting raw character code.
    >
    >> When you do that, the returned character or code unit or code point
    >> does not guarantee that you'll build valid Unicode strings. In fact,
    >> such character-level interface is not enough to work with and
    >> transform Strings (for example it does not work to perform correct
    >> transformation of lettercase, or to manage grapheme clusters).
    >
    > This is a different issue. Indeed transformations like case mapping
    > work in terms of strings, but in order to implement them you must
    > split a string into some units of bounded size (code points, bytes,
    > etc.).

    Yes, but why do you want that this intermediate unit be the code point? Such
    algorithm can be developped with any UTF, or even with compressed encoding
    schemes through accessor or enumerator methods...

    > All non-trivial string algorithms boil down to working on individual
    > units, because conditionals and dispatch tables must be driven by
    > finite sets. Any unit of a bounded size is technically workable, but
    > they are not equally convenient. Most algorithms are specified in
    > terms of code points, so I chose code points for the basic unit in
    > the API.

    "Most" is the right term here: this is not a requirement, and it's not
    because it is the simplest way to implement such algorithm that it will be
    the most efficient in terms of performance or resource allocations. Most
    experiences prove that the most efficient algorithms are also complex to
    implement.

    Code points are probably the easiest thing to describe what an text
    algorithm is supposed to do, but this is not a requirement for applications
    (in fact many libraries have been written that correctly implement the
    Unicode algorithms, without even dealing with code points, but only with
    in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with
    serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes).

    Which represent will be the best is left to implementers, but I really think
    that compressed schemes are often introduced to increase the application
    performances and reduce the needed resources both in memory and for I/O, but
    also in networking where interoperability across systems and bandwidth
    optimization are also important design goals...



    This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 09:41:35 CST