Re: Nicest UTF

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 04 2004 - 09:39:01 CST

Next message: Rene Hache: "Re: latin equivalent to specific indian characters"

Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>
>> Random access by code point index means that you don't use strings
>> as immutable objects,
>
> No. Look at Python, Java and C#: their strings are immutable (don't
> change in-place) and are indexed by integers (not necessarily by code
> points, but it doesn't change the point).

Those strings are not indexed. They are just accessible through methods or
accessors, that act *as if* they were arrays. There's nothing that requires
the string storage to use the same "exposed" array, and in fact you can as
well work on immutable strings, as if they were vectors of code points, or
vectors of code units, and sometimes vectors of bytes.

Note for example the difference between the .length property of Java arrays,
and the .length() method of java String instances...

Note also the fact that the "conversion" of an array of bytes or code units
or code points to a String requires distinct constructors, and that the
storage is copied rather than simply referenced (the main reason being that
indexed vectors or arrays are mutable in their indexed content, but not
String instances which become sharable).

Anyway, each time you use an index to access to some components of a String,
the returned value is not an immutable String, but a mutable character or
code unit or code point, from which you can build *other* immatable Strings
(using for example mutable StringBuffers or StringBuilder or similar objects
in other languages). When you do that, the returned character or code unit
or code point does not guarantee that you'll build valid Unicode strings. In
fact, such character-level interface is not enough to work with and
transform Strings (for example it does not work to perform correct
transformation of lettercase, or to manage grapheme clusters). The most
powerful (and universal) transformations are those that don't use these
interfaces directly, but that work on complete Strings and return complete
Strings.

The character-level APIs are convenience for very basic legacy
transformations, but they do not solve alone most internationalization
problems; or they are used as a "protected" interface that allow building
more powerful String to String transformations.

Once you realize that, which UTF you use to handle immutable String objects
is not important, because it becomes part of the "blackbox" implementation
of String instances. If you consider then the UTF as a blackbox, then the
real arguments for an UTF or another depends on the set of String-to-String
transformations you want to use (because it conditions the implmentation of
these transformations), but more importantly it affects the efficiency of
the String storage allocation.

For this reason, the blackbox can determine itself which UTF or internal
encoding is the best to perform those transformations: the total volume of
immutable string instances to handle in memory and the frequency of their
instanciation determines which representation to use (because large String
volumes will sollicitate the memory manager, and will seriously impact the
overall application performance).

Using SCSU for such String blackbox can be a good option if this effectively
helps in store many strings in a compact (for global performance) but still
very fast (for transformations) representation.

Unfortunately, the immutable String implementations in Java or C# or Python
does not allow the application designer to decide which representation will
be the best (they are implemented as concrete classes instead of virtual
interfaces with possible multiple implementations, as they should; the
alternative to interfaces would have been class-level methods allowing the
application to trade with the blackbox class implementation the tuning
parameters).

There are other classes or libraries within which such multiple
representations are possible and easily and transparently convertible from
one to the other. (Note that this discussion is related to the UTF used to
represent code points, but today, there are also needs to work on strings
within grapheme cluster boundaries, including the various normalization
forms, and a few libraries do exist for which the various normalizations can
be changed without changing the "immutable" aspect of Strings, the
complexity being that Strings do not always represent plain-text...)

Next message: Rene Hache: "Re: latin equivalent to specific indian characters"
Previous message: Peter Constable: "RE: OpenType vs TrueType (was current version of unicode-font)"
Maybe in reply to: Theodore H. Smith: "Nicest UTF"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 04 2004 - 09:42:28 CST