Re: Nicest UTF

From: Marcin 'Qrczak' Kowalczyk (
Date: Sat Dec 04 2004 - 18:37:14 CST

  • Next message: John Hudson: "Re: No Invisible Character - NBSP at the start of a word"

    "Philippe Verdy" <> writes:

    > There's nothing that requires the string storage to use the same
    > "exposed" array,

    The point is that indexing should better be O(1).

    Not having a constant side per code point requires one of three things:
    1. Using opaque iterators instead of integer indices.
    2. Exposing a different unit in the API.
    3. Living with the fact that indexing is not O(1) in general; perhaps
       with clever caching it's good enough in common cases.

    Altough all three choices can work, I would prefer to avoid them.
    If I had to, I would probably choose 1. But for now I've chosen a
    representation based on code points.

    > Anyway, each time you use an index to access to some components of a
    > String, the returned value is not an immutable String, but a mutable
    > character or code unit or code point, from which you can build
    > *other* immatable Strings

    No, individual characters are immutable in almost every language.
    Assignment to a character variable can be thought as changing the
    reference to point to a different character object, even if it's
    physically implemented by overwriting raw character code.

    > When you do that, the returned character or code unit or code point
    > does not guarantee that you'll build valid Unicode strings. In fact,
    > such character-level interface is not enough to work with and
    > transform Strings (for example it does not work to perform correct
    > transformation of lettercase, or to manage grapheme clusters).

    This is a different issue. Indeed transformations like case mapping
    work in terms of strings, but in order to implement them you must
    split a string into some units of bounded size (code points, bytes,

    All non-trivial string algorithms boil down to working on individual
    units, because conditionals and dispatch tables must be driven by
    finite sets. Any unit of a bounded size is technically workable, but
    they are not equally convenient. Most algorithms are specified in
    terms of code points, so I chose code points for the basic unit in
    the API.

    In fact in my language there is no separate character type: a code
    point extracted from a string is represented by a string of length 1.
    It doesn't change the fact that indexing a string by code point index
    should run in constant time, and thus using UTF-8 internally would be
    a bad idea unless we implement one of the three points above.

    > Once you realize that, which UTF you use to handle immutable String
    > objects is not important, because it becomes part of the "blackbox"
    > implementation of String instances.

    The black box must provide enough tools to implement any algorithm
    specified in terms of characters, an algorithm which was not already
    provided as a primitive by the language.

    Algorithms generally scan strings sequentially, but in order to store
    positions to come back to them later you must use indices or some
    iterators. Indices are simpler (and in my case more efficient).

    > Using SCSU for such String blackbox can be a good option if this
    > effectively helps in store many strings in a compact (for global
    > performance) but still very fast (for transformations) representation.

    I disagree. SCSU can be a separate type to be used explicitly, but
    it's a bad idea for the default string representation. Most strings
    are short, and thus constant factors and simplicity matter more than
    the amount of storage. And you wouldn't save much storage anyway:
    as I said, in my representation strings which contain only characters
    U+0000..U+00FF are stored one byte per character. The majority of
    strings in average programs is ASCII.

    In general what I don't like in SCSU is that there is no obvious
    compression algorithm which makes good use of various features. Each
    compression algorithm is either not as powerful as it could, or is
    extremely slow (trying various choices), or is extremely complicated
    (trying only sensible paths).

    > Unfortunately, the immutable String implementations in Java or C#
    > or Python does not allow the application designer to decide which
    > representation will be the best (they are implemented as concrete
    > classes instead of virtual interfaces with possible multiple
    > implementations, as they should; the alternative to interfaces would
    > have been class-level methods allowing the application to trade with
    > the blackbox class implementation the tuning parameters).

    Some functions accept any sequence of characters. Other functions
    accept only standard strings. The question is how often to use each

    Choosing the first option increases flexibility but adds an overhead
    in the common case. For example case mapping of a string would have to
    either perform dispatching functions at each step, or be implemented
    twice. Currently it's implemented for strings only, in C, and thus
    avoids calling a generic indexing function and other overheads. At
    some time I will probably implement it again, to work for arbitrary
    sequences of characters, but it's more work for effects that I don't
    currently need, so it's not a priority.

    Sometimes the flexibility is only on the surface. If I allowed any
    sequence of characters to specify a filename, it would have to be
    converted to an array of characters in order to be passed to the OS
    anyway. And if a string is used as a filename many times, it would
    be converted many times.

    You are free to use other string representations in your code,
    especially in dynamically typed languages which constrain the type
    only in low-level functions like using a string as a filename. But
    there is always some primitive character or string type, to make a
    generic string API possible at all (what would your virtual interfaces
    use?), and generally some primitive string type to make common cases
    of string handling fast.

    Note that if the other representation does not allow O(1) indexing by
    code point *and* it's not enough to always scan characters sequentially
    from the beginning to end, *then* you must invent some iteration
    protocol not currently present in my language.

       __("<         Marcin Kowalczyk

    This archive was generated by hypermail 2.1.5 : Sat Dec 04 2004 - 18:42:53 CST