Re: Matching Unicode strings and combining characters [was: basic

From: Paul Keinanen (
Date: Thu Sep 30 1999 - 12:10:59 EDT

On Thu, 30 Sep 1999 07:00:17 -0700 (PDT), "Reynolds, Gregg"
<> wrote:

>I don't mean to be nasty (there are other threads for that ;) ), but this
>subject has come up several times and for the life of me I can't see what's
>so difficult about it.

There is no problem as long as all the data is available at once as is
the case with a disk file or when there is a low level protocol that
guaranties that a complete abstract character (base+all combining
marks) are delivered in one atomic unit e.g. in a UDP frame.

The problem starts when the bytes are not delivered as atomic units
e.g. in asynchronous serial lines or TCP/IP.

> Unicode has a grammar for characters; if you want to
>support it, then gather all the bytes you need before proceeding. How hard
>is this?

How do I now that all bytes have arrived ? How long do I have to wait
to now that no more characters will arrive after a base character.

>It's no different than waiting for a full 8 bits (or 7) before
>interpreting an ascii character. If you receive 0110 01, do you assume the
>following two bits are 00 and emit a 'd'? Obviously not; you wait for the
>next two bits.

In an asynchronous link, the arrival time of the remaining bits are
extremely well defined and based on this, some UARTs can be configured
to return the character when exactly the specified number of data bits
have been received. This is analogous if each abstract Unicode
character was transmitted as 8 bytes, i.e. most common character as 2
bytes and 6 filler bytes or 2 base + 2 combining mark bytes followed
by 4 filler bytes etc. In this case the receiver knows that after it
has received 8 bytes, a complete abstract character has been received.

A more common situation with asynchronous communication is that the
receiving UART after receiving the specified number of data bits waits
for one more well defined bit time and checks that it really received
a '1' bit, which is the stop bit in asynchronous communication. In the
Unicode environment this would be analogous with transmitting any
number of bytes to send the base and combined characters and finally
terminate each character with U+0000 or ZWNJ i.e. the stop 'bit'.

>No doubt somebody will object that bits are accumulated at a
>lower level - precisely the point! I imagine somebody will get excited
>about displaying characters "as soon as they are received" for the sake of
>responsiveness, to which the answer is, yes, and not before they are fully

Take for example the Windows style menu (e.g. the File menu, where you
can select Open or Print by simply pressing the 'o' or the 'p'
character from the keyboard (or whatever character is underlined in
the menu). Suppose you have a menu where one option starts with 'a'
and an other with <a-ring>. If there is some "flexible" connection
(e.g. telnet or asynchronous serial line) between the keyboard and the
device interpreting the selection. When can the application determine
if the <a> or <a-ring> key was pressed in order to activate the
function ?

>In short, there is no problem here, only a decision to be made as to whether
>or not to support Unicode.

So Unicode is not supported in a real time environment ?


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT