RE: A basic question on encoding Latin characters

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 28 1999 - 19:35:36 EDT


Robert,

>
> Indeed. However, for the same reason the flicker happens, Unicode
> combining characters break with 0-lookahead string matching [1] like that
> in "expect". This obviously cannot be fixed.
>

User Commands EXPECT(1)

           If a pattern matches, then the corresponding body is
           executed. expect returns the result of the body (or
           the empty string if no pattern matched). In the event
           that multiple patterns match, the one appearing first
           is used to select a body.

           Each time new output arrives, it is compared to each
           pattern in the order they are listed. Thus, you may
           test for absence of a match by making the last pattern
           something guaranteed to appear, such as a prompt. In
           situations where there is no prompt, you must use
           timeout (just like you would if you were interacting
           manually).

> In UTF-8 over a TTY, you cannot tell the difference between
>
> a <wait-forever>
> and
> a <very-long-pause-indeed> <combining-ring-above>

If you are expecting a pattern match for an a-ring, and if you haven't
specified normalized data in your data stream, then the pattern
match should be set up so that:

     <-- matches
    a <pause> <combining-ring-above> <-- matches

Yes, this precludes having a distinct pattern match on <a> itself.
But that is no different than if you are expecting a pattern match
for "ab":

    a <pause> b <--matches

That also precludes having a distinct pattern match on <a> itself.
Where is the difference? Not a letter, you say?

Then how about Czech "ch", which every Czech will tell you is a letter:

    c <pause> h <--matches, but precludes a match on <c> alone.

If you *must* have a pattern match on both <a> and <> in this kind
of environment, then specify Normalization Form C for the data stream.
That avoids having the two canonical equivalences both in the matching
patterns and allows you to use precomposed Latin characters for your
matches without screwing up matches on the base forms as well.

>
> unless you use a timeout-based system, which should be avoided for Very
> Good Reasons.

No argument there.

>
> This is especially bad for keypress-based apps whose users may wish to
> define different commands for "a" and "a-ring-above". This is an actual
> practical problem.

Pattern matching such as that described above for Expect is another
disguised instance of ASCII-think in computers: it depends on the
assumption character=glyph (or more correctly, character=grapheme in
this case), since it is expecting one user "keypress" for their
"character"--i.e. grapheme--to correspond to one character position
in a matching string table. This is amenable to the Latin-1, Latin-2,
type hacks to deal with Swedish, German, French, etc., but gets
progressively more broken as the model is extended to other writing
systems. So the keypress-based app with precomposed letters works for
"a-ring-above" without disturbing a match against "a". What if my
language is Klikitat or something, and I have letters spelled
"t modifier-letter-theta combining-comma-above" that have to be
kept distinct from "t" and "t modifier-letter-theta"?? Is that an
argument to keep encoding hundreds more digraphs, trigraphs, letter-accent,
letter-double-accent, digraph-accent, combinations, etc., so that the
broken keypress-based apps can work for (several thousand) Latin-based languages
without change? I don't think so.

>
> This cannot be fixed, but pressure to move to canonically-decomposed forms
> of text will make the problem more noticable.
>

Which is why Mark Davis worked so hard on specifying Normalization Form C--
so that lots of legacy software can have a less complex path to move
forward on as it is adapted to Unicode. But the path is still
complex and is not without the issues of combining characters.

A good portion of the interoperable data world is going to standardize
on Unicode Normalization Form C, I expect. That isn't going to be the
10646 Implementation Level 1 that some still seem to be pining for
to simplify their lives. That doesn't stop people from just implementing
10646 Implementation Level 1, but Level 1 is just not sufficient for
worldwide use.

--Ken

> --
> Robert
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT