Re: "ch" as in yecch

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Sat Oct 23 1999 - 17:16:32 EDT


You raise a very good point.

What we use in ICU and in Java is a BreakIterator, that gives you 'character' boundaries (you can also
choose word, line and sentence boundaries). Cursor movement is one area where character boundaries are
useful; searching is another (if a search matches, you want to ensure that the boundaries of the match are
character boundaries.

You use getCharacterInstance(Locale) to get the iterator; it will then give you boundaries on text. The
character properties of the whole are extrapolated from the first code point.

The Java interface is at:

http://java.sun.com/products/jdk/1.2/docs/api/java/text/BreakIterator.html

The ICU C++ interface is at

http://www10.software.ibm.com/developerworks/opensource/icu/project/html/BreakIterator.html

It also has the overall documentation.

The ICU C interface is in separate routines all starting with "ubrk_", e.g. to open, iterate, and close you
would use the following:

http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_open.html
http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_next.html
http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_close.html

For random access, you use:
http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_following.html, or
http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_preceding.html

Mark

Tex Texin wrote:

> Dear Uni-people,
>
> I am of course a supporter and a benefiter of Unicode
> and its many improvements over legacy encodings. As an application
> implementer, and not a linguist, typographer, or nationalist (IE
> not favoring one language or politics over another),
> I look to Unicode to provide me with standardization so I can
> provide world-wide plain-text support. I like that Unicode
> defines algorithms for bidirectional support, character properties
> and the like, and I am no longer in the business of researching
> both code pages and the algorithms for using those code pages.
> (Well, I do a lot less of it now anyway. ;-) )
>
> As a software designer, I need to understand and rely on some
> basic principles. For example, I have to have a GetNextCharacter
> routine. This list can go on and on (and on...) about abstract
> characters, versus letters, versus graphemes, but I need to
> implement something close to what a user expects, (or what I can
> teach them to work with) and to have
> an element, or basic unit, that I can manipulate and design to use
> in my software.
>
> I can work with 16-bit units as Unicode defines them, and I can
> program to either provide users with behaviors based on these
> units (e.g. cursor-right moves through each unit, i.e. through
> each diacritic, tone mark, etc.) or I can
> provide users with a more complete element that users traditionally
> think of as a character (e.g. cursor-right moves to the next letter).
>
> However, Unicode does not prescribe how I recognize or work with
> with elements like "ch". As an implementer, I thought GetNextCharacter
> would look for a 16-bit abstract character (or multi-16-bit
> surrogate) followed by some non-spacing elements. Apparently, this
> is not sufficient.
>
> I also thought I could look up properties of "characters" but I
> don't know where to look up the properties of "ch".
>
> I imagine some of these "multi-non-spacing character" elements,
> will cause me to redesign my approach to bidi as well.
>
> Although I recognize the many benefits of Unicode, if I cannot
> understand how to reliably implement GetNextCharacter, and have
> a property table for these "multi-non-spacing character" elements,
> then many current designs for Unicode applications are now inadequate.
> For developers, the greatest benefit of Unicode was it provided
> standardization for the basic character element.
> I suddenly feel thrown back into the multi-codepage
> quagmire of researching researching, researching, and probably
> continually revamping and re-generalizing my software to accomodate
> new character types as I uncover them.
>
> I believe I do understand the rationales offered for why "ch"
> should not be a character. I would rather the onus be shifted to
> input methods and legacy conversion programs to determine whether
> "ch" should be encoded as a single element or not, rather than
> having all remaining software be continually analyzing this.
>
> I would not like for this note to kick off a repetition of everything
> that has already been said many times. Perhaps it is inevitable.
> I would really like a clear definition for knowing how many bytes
> or 16-bit words to read to get to the end of the current "letter",
> and how to look up the properties of that letter, its case, etc.
> These are the basic elements I need to use in my software.
> The algorithms for doing this, should be able to support Slovak and
> other languages in a uniform way.
>
> tex
> --
> Progress Software: The #1 Embedded Database
> -------------------------------------------------------------------------------------------------------
> Tex Texin Director, International Products
>
> Progress Software Corp. Voice: +1-781-280-4271
> 14 Oak Park Fax: +1-781-280-4949
> Bedford, MA 01730 USA texin@bedford.progress.com
>
> http://www.progress.com http://apptivity.progress.com
> -------------------------------------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT