Re: "ch" as in yecch

From: peter_constable@sil.org
Date: Mon Oct 25 1999 - 00:04:23 EDT


       This is a valuable pair of contributions that has come out of a
       rambling and not very significant thread which has gone on
       mostly in an effort to help increase the general level of
       understanding of participants of this list as to how Unicode is
       used to represent in software the real-world elements of
       writing systems (not entirely successfully, I fear).

       Software developers, pay attention! (And thanks, Tex and Mark.)

       Peter

       ---------------------- Forwarded by Peter
       Constable/IntlAdmin/WCT on 10/24/99 09:58 PM
       ---------------------------

       From: <markdavis@ispchannel.com> AT Internet on 10/23/99 04:14
             PM CDT

       Received on: 10/23/99

       To: Peter Constable/IntlAdmin/WCT, unicode@unicode.org AT
             Internet@Ccmail
       cc:
       Subject: Re: "ch" as in yecch


       You raise a very good point.

       What we use in ICU and in Java is a BreakIterator, that gives
       you 'character'
       boundaries (you can also
       choose word, line and sentence boundaries). Cursor movement is
       one area where
       character boundaries are
       useful; searching is another (if a search matches, you want to
       ensure that the
       boundaries of the match are
       character boundaries.

       You use getCharacterInstance(Locale) to get the iterator; it
       will then give you
       boundaries on text. The
       character properties of the whole are extrapolated from the
       first code point.

       The Java interface is at:

       http:
       //java.sun.com/products/jdk/1.2/docs/api/java/text/BreakIterato
       r.html

       The ICU C++ interface is at

       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/BreakIt
       er
       ator.html

       It also has the overall documentation.

       The ICU C interface is in separate routines all starting with
       "ubrk_", e.g. to
       open, iterate, and close you
       would use the following:

       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/ubrk_op
       en
       .html
       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/ubrk_ne
       xt
       .html
       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/ubrk_cl
       os
       e.html

       For random access, you use:
       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/ubrk_fo
       ll
       owing.html, or
       http:
       //www10.software.ibm.com/developerworks/opensource/icu/project/
       html/ubrk_pr
       ec
       eding.html

       Mark

       Tex Texin wrote:

> Dear Uni-people,
>
> I am of course a supporter and a benefiter of Unicode
> and its many improvements over legacy encodings. As an
       application
> implementer, and not a linguist, typographer, or nationalist
       (IE
> not favoring one language or politics over another),
> I look to Unicode to provide me with standardization so I can
> provide world-wide plain-text support. I like that Unicode
> defines algorithms for bidirectional support, character
       properties
> and the like, and I am no longer in the business of
       researching
> both code pages and the algorithms for using those code
       pages.
> (Well, I do a lot less of it now anyway. ;-) )
>
> As a software designer, I need to understand and rely on some
> basic principles. For example, I have to have a
       GetNextCharacter
> routine. This list can go on and on (and on...) about
       abstract
> characters, versus letters, versus graphemes, but I need to
> implement something close to what a user expects, (or what I
       can
> teach them to work with) and to have
> an element, or basic unit, that I can manipulate and design
       to use
> in my software.
>
> I can work with 16-bit units as Unicode defines them, and I
       can
> program to either provide users with behaviors based on these
> units (e.g. cursor-right moves through each unit, i.e.
       through
> each diacritic, tone mark, etc.) or I can
> provide users with a more complete element that users
       traditionally > think of as a character (e.g. cursor-right
       moves to the next letter).
>
> However, Unicode does not prescribe how I recognize or work
       with
> with elements like "ch". As an implementer, I thought
       GetNextCharacter
> would look for a 16-bit abstract character (or multi-16-bit
> surrogate) followed by some non-spacing elements. Apparently,
        this
> is not sufficient.
>
> I also thought I could look up properties of "characters" but
        I
> don't know where to look up the properties of "ch".
>
> I imagine some of these "multi-non-spacing character"
       elements,
> will cause me to redesign my approach to bidi as well.
>
> Although I recognize the many benefits of Unicode, if I
       cannot
> understand how to reliably implement GetNextCharacter, and
       have
> a property table for these "multi-non-spacing character"
       elements,
> then many current designs for Unicode applications are now
       inadequate.
> For developers, the greatest benefit of Unicode was it
       provided
> standardization for the basic character element.
> I suddenly feel thrown back into the multi-codepage
> quagmire of researching researching, researching, and
       probably
> continually revamping and re-generalizing my software to
       accomodate > new character types as I uncover them.
>
> I believe I do understand the rationales offered for why "ch"
> should not be a character. I would rather the onus be shifted
        to
> input methods and legacy conversion programs to determine
       whether
> "ch" should be encoded as a single element or not, rather
       than
> having all remaining software be continually analyzing this.
>
> I would not like for this note to kick off a repetition of
       everything
> that has already been said many times. Perhaps it is
       inevitable.
> I would really like a clear definition for knowing how many
       bytes
> or 16-bit words to read to get to the end of the current
       "letter",
> and how to look up the properties of that letter, its case,
       etc.
> These are the basic elements I need to use in my software.
> The algorithms for doing this, should be able to support
       Slovak and > other languages in a uniform way.
>
> tex
> --
> Progress Software: The #1 Embedded Database
>
       ---------------------------------------------------------------
       -----------------
       --
       ---------------------
> Tex Texin Director, International
       Products
>
> Progress Software Corp. Voice: +1-781-280-4271
> 14 Oak Park Fax: +1-781-280-4949
> Bedford, MA 01730 USA texin@bedford.progress.com
>
> http://www.progress.com http://apptivity.progress.com
>
       ---------------------------------------------------------------
       -----------------
       --
       ---------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT