Matching Unicode strings and combining characters [was: basic question...]

From: Juliusz Chroboczek (jec@dcs.ed.ac.uk)
Date: Thu Sep 30 1999 - 08:45:31 EDT


Dear everyone,

Frank has been describing fairly clearly a problem due to the fact
that combining characters follow the base characters. As I understand
it, nobody is asking that Unicode be changed; it is just that the
problem needs to be understood before we can think about working
around it.

Consider a consumer of an ASCII stream (a terminal emulator, an
`expect' script, etc.) The consumer needs to recognise certain
conditions; and it has been known for a long time that only some
conditions are /observable/ (terminology due, I believe, to Abramsky,
although the same observation was made before with different
terminology).

For example, the consumer can wait for the appearance of an `a'
(algorithm: wait for a character, if it is an `a' return, else
repeat). Therefore, the set of all (possibly infinite) strings
containing an `a' is an observable. On the other hand, the set of
(possibly infinite) strings *not* containing an `a' is not an
observable (at any point in time, you cannot be sure that an `a' will
not appear later).

Note furthermore that the set of all strings in which the first
occurence of `a' is not followed with `b' is not observable (after
receiving the `a', you might need to wait forever). On the other
hand, the set of strings in which the first `a' is followed by
something else than a `b' is.

I'd like to insist that this is not due to the particular
implementation; this is a theoretical result, that applies to all
possible implementations (assuming no timeout is used). There's
nothing new about it, this is the bread and butter of domain theory.

(By the way, for the mathematically minded, the set of observations is
closed by union and finite intersection; it is therefore a topology.
The notion of ``expect script'' is usually formalised as a continuous
map from this topology into the Sierpinski space.)

The notion of observation over ASCII is the one that we've learned to
live with. I am quite willing to expect that if I search for
``interaction'', I'll also match the beginning of ``interactions''.
But in Unicode, the problem is compounded by canonical equivalence and
by the combining characters. Thus, we end up with the situation that
it is impossible to match ``voila'' without also matching the
beginning of ``voilà'' (more formally, there is no observation
containing all the strings containing ``voila'' without also contaning
at least some representations of ``voilà''). This is not what the
user expects. This problem does not occur if the combining characters
are placed before, rather than after, the base character.

Contrary to what one contributor said, the Polish user[1] does indeed
expect a string such as ``Banac'' to match the beginning of ``Banach''
(although ``ch'' is a single letter in Polish grammar). (The actual
example used Czech.) On the other hand, nobody will convince me that
the Italian user expects ``casino'' to match the beginning of
``casinò'', or that the French user expects the verb ``a'' to match
the beginning of the preposition ``à''.

Contrary to what another contributor implied, this has nothing to do
with contextual substitution. The question of whether the Arabic
``mujallis'' (with a final sin) should match the beginning of
``mujallisIn'' (with a medial sin) is a completely orthogonal
question. One cannot just dismiss people stating this problem as not
having grasped the distinction between character and glyph.

As far as I remember, the solutions proposed by the list have been:

* convince your users that having the beginning of `voilà' match
  `voila' is the expected behaviour (I wish you happy convincing);

* use a precomposed normalisation form for all protocols (quite
  reasonable and easy to implement, but only solves the problem for
  encoded composites);

* explicitly terminate all output with a control character (say,
  U+0000); this would require changing all software involved
  (by the way, what about using ZWNJ for the purpose?);

* use a `inverted Unicode' encoding in the protocol, one that puts
  combining characters *before* the base characters.

If I have missed any other solutions, please yell at me.

Sincerely,

                                        J.

[1] At least one Polish user does.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT