RE: Matching Unicode strings and combining characters [was: basic question...]

From: Reynolds, Gregg (greynolds@datalogics.com)
Date: Thu Sep 30 1999 - 10:02:16 EDT


> -----Original Message-----
> From: Juliusz Chroboczek [mailto:jec@dcs.ed.ac.uk]
> Sent: Thursday, September 30, 1999 7:47 AM
> To: Unicode List
> Subject: Matching Unicode strings and combining characters [was: basic
> question...]
>
>
> Dear everyone,
> [...]
>
> by the combining characters. Thus, we end up with the situation that
> it is impossible to match ``voila'' without also matching the
> beginning of ``voilà'' (more formally, there is no observation
> containing all the strings containing ``voila'' without also contaning
> at least some representations of ``voilà''). This is not what the
> user expects. This problem does not occur if the combining characters
> are placed before, rather than after, the base character.

I'm afraid it does indeed occur; it just looks a little different. You've
made an unfounded assumption about rendering (or querying or whatever)
behavior, to wit, that in a grammar with trailing modifiers the base
character shall be rendered (or otherwise "observed" or dealt with) in the
absence of its modifier, whereas in a grammar with leading modifiers the
modifier shall not be rendered (interpreted, etc) in the absence of its
basechar. How would your Frenchman feel about "voil`"? In other words your
examples disregard the semantics of Unicode. So far as I know, nothing in
Unicode says that a process that receives a partially transmitted
<base+diacritic> may legitimately interpret it as <base>.

I don't mean to be nasty (there are other threads for that ;) ), but this
subject has come up several times and for the life of me I can't see what's
so difficult about it. Unicode has a grammar for characters; if you want to
support it, then gather all the bytes you need before proceeding. How hard
is this? It's no different than waiting for a full 8 bits (or 7) before
interpreting an ascii character. If you receive 0110 01, do you assume the
following two bits are 00 and emit a 'd'? Obviously not; you wait for the
next two bits. No doubt somebody will object that bits are accumulated at a
lower level - precisely the point! I imagine somebody will get excited
about displaying characters "as soon as they are received" for the sake of
responsiveness, to which the answer is, yes, and not before they are fully
received.

In short, there is no problem here, only a decision to be made as to whether
or not to support Unicode.

Sincerely,

Gregg



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT