FW: Matching Unicode strings and combining characters [was: basic question...]

From: Marco.Cimarosti@icl.com
Date: Thu Sep 30 1999 - 11:12:09 EDT

J. Chroboczek wrote:

        Contrary to what one contributor said, the Polish user[1] does
        expect a string such as ``Banac'' to match the beginning of
        (although ``ch'' is a single letter in Polish grammar). (The actual
        example used Czech.) On the other hand, nobody will convince me
        the Italian user expects ``casino'' to match the beginning of
        ``casinò'', or that the French user expects the verb ``a'' to match
        the beginning of the preposition ``à''.

I am the Italian user, and what you say about me is wrong.

I do expect "casino" to be the beginning of "casinò", because Italian
accents are mostly seen as signs that *follow* the vowel.
In hand writing, they are always traced after the vowel and, often, far to
the right of it.

Also with typewriters or computers, accents (that are always on the last
letter of a word) are normally substituted by ASCII apostrophe (or "single
quote"). You can check any Italian news group to see that most people would
write citta' rather than città, casino' rather than casinò.

There is also an historical reason for this: most of the accents in modern
Italian spelling in older times used to be apostrophes. All those
apostrophes were used to show that the last syllable of a word had been
dropped and, thus, the stressed syllable (formerly the second-last) had
become the last one. This is true for almost all Italian words wearing an
accent (exceptions are the future tense form of verbs and a few borrowed
words): città used to be citta' and, even before cittade (cognate of Spanish

So, yes, when I search città (precomposed) I would like my program to find
also città (combining mark) citta', citta`, and citta.

- - - - -

But, apart this, the "terminal" problem that is being discussed is
absolutely a false problem.

In fact, as has arleady been said, why should the case of "login:^" be any
different from the case of "login:Q"?

What is the NEW problem brought by unicode or combining characters?

Somebody says: if my application is waiting for "login", it will not trigger
if it receives "logiñ" (where ñ is a precomposed) but it would trigger with
"login~", (where ~ is a combining mark). That is true, so what!? It will
also trigger with "Skloginnat!" and not trigger with "Login". So what? It
will also not trigger with "login" (where o is a Cyrillic letter). So what?

A terminal application is a program made to handle three things: a display,
a keyboard, and a communication line.

There cannot be any concept like "Unicode combining characters" in the
module that handles the *communications line*. Even better: there can be no
"character" concept at all in this module! Its task is just to receive and
transmit *data*, in chunks of 7, 8, 16, 32, 64, 999 bits and has no need
(nor rights) to know what these data represent.

Similarly, you cannot have any "dead keys" or "communications protocols" in
the *display* module. The task of this module is getting *meaningless*
sequences of bits ("characters") and transform them in graphics on the
screen that an human may recognise as "writing". The module does not know
whether it has been linked in a terminal application or in the Rogue
programs: it just displays Unicode text for whatever caller.

The *keyboard* module, finally, will take care of keys (dead or alive) and
could also do quite complex things with dictionary-based input methods for
CJK scripts. But it not reaquired nor allowed to know anything about the
property of characters, or whether the line is buisy or not.

If you want to add something to this basic display+keyboard+line scheme, you
have to write it properly and plug it in the right place.

If you decide to add a scripting facility to automate some tasks, you should
write a scripting facility, not something different... This module MUST NOT
know anything about characters (that's the task of display module), nor keys
(that's the task of keyboard module), nor protocols (that's the task of the
line module): it just has to know how to be able to load, parse and execute
a script.

The problem are caused by the fact that all the program's modules don't do
their business, and they try to park their nose where they shouldn't. Or,
even worst, there are no modules at all, and the terminal programs we are
talking about account to spaghetti code.

        Marco Cimarosti

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT