"ch" as in yecch

From: Tex Texin (texin@progress.com)
Date: Sat Oct 23 1999 - 15:32:16 EDT

Dear Uni-people,

I am of course a supporter and a benefiter of Unicode
and its many improvements over legacy encodings. As an application
implementer, and not a linguist, typographer, or nationalist (IE
not favoring one language or politics over another),
I look to Unicode to provide me with standardization so I can
provide world-wide plain-text support. I like that Unicode
defines algorithms for bidirectional support, character properties
and the like, and I am no longer in the business of researching
both code pages and the algorithms for using those code pages.
(Well, I do a lot less of it now anyway. ;-) )

As a software designer, I need to understand and rely on some
basic principles. For example, I have to have a GetNextCharacter
routine. This list can go on and on (and on...) about abstract
characters, versus letters, versus graphemes, but I need to
implement something close to what a user expects, (or what I can
teach them to work with) and to have
an element, or basic unit, that I can manipulate and design to use
in my software.

I can work with 16-bit units as Unicode defines them, and I can
program to either provide users with behaviors based on these
units (e.g. cursor-right moves through each unit, i.e. through
each diacritic, tone mark, etc.) or I can
provide users with a more complete element that users traditionally
think of as a character (e.g. cursor-right moves to the next letter).

However, Unicode does not prescribe how I recognize or work with
with elements like "ch". As an implementer, I thought GetNextCharacter
would look for a 16-bit abstract character (or multi-16-bit
surrogate) followed by some non-spacing elements. Apparently, this
is not sufficient.

I also thought I could look up properties of "characters" but I
don't know where to look up the properties of "ch".

I imagine some of these "multi-non-spacing character" elements,
will cause me to redesign my approach to bidi as well.

Although I recognize the many benefits of Unicode, if I cannot
understand how to reliably implement GetNextCharacter, and have
a property table for these "multi-non-spacing character" elements,
then many current designs for Unicode applications are now inadequate.
For developers, the greatest benefit of Unicode was it provided
standardization for the basic character element.
I suddenly feel thrown back into the multi-codepage
quagmire of researching researching, researching, and probably
continually revamping and re-generalizing my software to accomodate
new character types as I uncover them.

I believe I do understand the rationales offered for why "ch"
should not be a character. I would rather the onus be shifted to
input methods and legacy conversion programs to determine whether
"ch" should be encoded as a single element or not, rather than
having all remaining software be continually analyzing this.

I would not like for this note to kick off a repetition of everything
that has already been said many times. Perhaps it is inevitable.
I would really like a clear definition for knowing how many bytes
or 16-bit words to read to get to the end of the current "letter",
and how to look up the properties of that letter, its case, etc.
These are the basic elements I need to use in my software.
The algorithms for doing this, should be able to support Slovak and
other languages in a uniform way.


Progress Software: The #1 Embedded Database 
Tex Texin                      Director, International Products
Progress Software Corp.        Voice:         +1-781-280-4271
14 Oak Park                      Fax:         +1-781-280-4949
Bedford, MA 01730  USA             texin@bedford.progress.com

http://www.progress.com http://apptivity.progress.com -------------------------------------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT