Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 04 2004 - 09:39:01 CST

  • Next message: Rene Hache: "Re: latin equivalent to specific indian characters"

    From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
    >
    >> Random access by code point index means that you don't use strings
    >> as immutable objects,
    >
    > No. Look at Python, Java and C#: their strings are immutable (don't
    > change in-place) and are indexed by integers (not necessarily by code
    > points, but it doesn't change the point).

    Those strings are not indexed. They are just accessible through methods or
    accessors, that act *as if* they were arrays. There's nothing that requires
    the string storage to use the same "exposed" array, and in fact you can as
    well work on immutable strings, as if they were vectors of code points, or
    vectors of code units, and sometimes vectors of bytes.

    Note for example the difference between the .length property of Java arrays,
    and the .length() method of java String instances...

    Note also the fact that the "conversion" of an array of bytes or code units
    or code points to a String requires distinct constructors, and that the
    storage is copied rather than simply referenced (the main reason being that
    indexed vectors or arrays are mutable in their indexed content, but not
    String instances which become sharable).

    Anyway, each time you use an index to access to some components of a String,
    the returned value is not an immutable String, but a mutable character or
    code unit or code point, from which you can build *other* immatable Strings
    (using for example mutable StringBuffers or StringBuilder or similar objects
    in other languages). When you do that, the returned character or code unit
    or code point does not guarantee that you'll build valid Unicode strings. In
    fact, such character-level interface is not enough to work with and
    transform Strings (for example it does not work to perform correct
    transformation of lettercase, or to manage grapheme clusters). The most
    powerful (and universal) transformations are those that don't use these
    interfaces directly, but that work on complete Strings and return complete
    Strings.

    The character-level APIs are convenience for very basic legacy
    transformations, but they do not solve alone most internationalization
    problems; or they are used as a "protected" interface that allow building
    more powerful String to String transformations.

    Once you realize that, which UTF you use to handle immutable String objects
    is not important, because it becomes part of the "blackbox" implementation
    of String instances. If you consider then the UTF as a blackbox, then the
    real arguments for an UTF or another depends on the set of String-to-String
    transformations you want to use (because it conditions the implmentation of
    these transformations), but more importantly it affects the efficiency of
    the String storage allocation.

    For this reason, the blackbox can determine itself which UTF or internal
    encoding is the best to perform those transformations: the total volume of
    immutable string instances to handle in memory and the frequency of their
    instanciation determines which representation to use (because large String
    volumes will sollicitate the memory manager, and will seriously impact the
    overall application performance).

    Using SCSU for such String blackbox can be a good option if this effectively
    helps in store many strings in a compact (for global performance) but still
    very fast (for transformations) representation.

    Unfortunately, the immutable String implementations in Java or C# or Python
    does not allow the application designer to decide which representation will
    be the best (they are implemented as concrete classes instead of virtual
    interfaces with possible multiple implementations, as they should; the
    alternative to interfaces would have been class-level methods allowing the
    application to trade with the blackbox class implementation the tuning
    parameters).

    There are other classes or libraries within which such multiple
    representations are possible and easily and transparently convertible from
    one to the other. (Note that this discussion is related to the UTF used to
    represent code points, but today, there are also needs to work on strings
    within grapheme cluster boundaries, including the various normalization
    forms, and a few libraries do exist for which the various normalizations can
    be changed without changing the "immutable" aspect of Strings, the
    complexity being that Strings do not always represent plain-text...)



    This archive was generated by hypermail 2.1.5 : Sat Dec 04 2004 - 09:42:28 CST