From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Dec 05 2004 - 09:33:40 CST
----- Original Message -----
From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
To: <unicode@unicode.org>
Sent: Sunday, December 05, 2004 1:37 AM
Subject: Re: Nicest UTF
> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>
>> There's nothing that requires the string storage to use the same
>> "exposed" array,
>
> The point is that indexing should better be O(1).
SCSU is also O(1) in terms of indexing complexity... simply because it keeps
the exact equivalence with codepoints, and requires a *fixed* (and small)
number of steps to decode it to code points, but also because the decoder
states uses a *fixed* (and small) number of variables for the internal
context (unlike more powerful compression algorithms like dictionnary-based,
Lempel-Ziv-Welsh-like, algorithms such as deflate).
> Not having a constant side per code point requires one of three things:
> 1. Using opaque iterators instead of integer indices.
> 2. Exposing a different unit in the API.
> 3. Living with the fact that indexing is not O(1) in general; perhaps
> with clever caching it's good enough in common cases.
>
> Altough all three choices can work, I would prefer to avoid them.
> If I had to, I would probably choose 1. But for now I've chosen a
> representation based on code points.
>
>> Anyway, each time you use an index to access to some components of a
>> String, the returned value is not an immutable String, but a mutable
>> character or code unit or code point, from which you can build
>> *other* immatable Strings
>
> No, individual characters are immutable in almost every language.
But individual characters do not always have any semantic. For languages,
the relevant unit is almost always the grapheme cluster, not the character
(so not its code point...). As grapheme clusters need to be represented on
variable lengths, an algorithm that could only work with fixed-width units
would not work internationaly or would cause serious problems for correct
analysis or transformation of true languages.
> Assignment to a character variable can be thought as changing the
> reference to point to a different character object, even if it's
> physically implemented by overwriting raw character code.
>
>> When you do that, the returned character or code unit or code point
>> does not guarantee that you'll build valid Unicode strings. In fact,
>> such character-level interface is not enough to work with and
>> transform Strings (for example it does not work to perform correct
>> transformation of lettercase, or to manage grapheme clusters).
>
> This is a different issue. Indeed transformations like case mapping
> work in terms of strings, but in order to implement them you must
> split a string into some units of bounded size (code points, bytes,
> etc.).
Yes, but why do you want that this intermediate unit be the code point? Such
algorithm can be developped with any UTF, or even with compressed encoding
schemes through accessor or enumerator methods...
> All non-trivial string algorithms boil down to working on individual
> units, because conditionals and dispatch tables must be driven by
> finite sets. Any unit of a bounded size is technically workable, but
> they are not equally convenient. Most algorithms are specified in
> terms of code points, so I chose code points for the basic unit in
> the API.
"Most" is the right term here: this is not a requirement, and it's not
because it is the simplest way to implement such algorithm that it will be
the most efficient in terms of performance or resource allocations. Most
experiences prove that the most efficient algorithms are also complex to
implement.
Code points are probably the easiest thing to describe what an text
algorithm is supposed to do, but this is not a requirement for applications
(in fact many libraries have been written that correctly implement the
Unicode algorithms, without even dealing with code points, but only with
in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with
serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes).
Which represent will be the best is left to implementers, but I really think
that compressed schemes are often introduced to increase the application
performances and reduce the needed resources both in memory and for I/O, but
also in networking where interoperability across systems and bandwidth
optimization are also important design goals...
This archive was generated by hypermail 2.1.5 : Sun Dec 05 2004 - 09:41:35 CST