Re: "ch" as in yecch

From: Tex Texin (texin@progress.com)
Date: Mon Oct 25 1999 - 03:32:57 EDT


Thanks again Mark.
When I started this thread I wasn't sure what I wanted either. It
was just clear to me that Unicode was missing something.
Your replies helped me focus on what is needed.
I did know that the default Unicode collation did not have contractions
(as they are called in tr10) such as ch.

I am glad to look at the ICU locales. Maybe I can extract the list I
am looking for from there.

tex

"Mark E. Davis" wrote:
>
> I am really not trying to avoid your question; this is the first time it's clear what you want.
>
> The standard doesn't provide locale-specific data. The information can be derived from a UCA-conformant
> implementation using the collation sequence for the language in question, as described in the (draft) versions of
> TR18 and TR10.
>
> For example, look at the collation rules for Slovak or Croatian for ICU. Other systems may or may not make the
> information accessible.
>
> http://www10.software.ibm.com/developerworks/opensource/icu/localeexplorer/en/?_=sk_SK&
> http://www10.software.ibm.com/developerworks/opensource/icu/localeexplorer/en/?_=hr
>
> Tex Texin wrote:
>
> > Mark,
> > Thanks Mark, I am familiar with these. It finesses the
> > question though. Where does the standard define (for each locale
> > if needed), what is or isn't a character/letter/grapheme?
> >
> > If my application uses ICU, and some other application doesn't,
> > if we interchange data (client/server) will we get the same results
> > with the same queries/operations?
> >
> > Also, I noticed in a few places I wrote "multi-non-spacing characters"
> > where I meant "multi-spacing characters". ah well.
> >
> > tex
> >
> > "Mark E. Davis" wrote:
> > >
> > > You raise a very good point.
> > >
> > > What we use in ICU and in Java is a BreakIterator, that gives you 'character' boundaries (you can also
> > > choose word, line and sentence boundaries). Cursor movement is one area where character boundaries are
> > > useful; searching is another (if a search matches, you want to ensure that the boundaries of the match are
> > > character boundaries.
> > >
> > > You use getCharacterInstance(Locale) to get the iterator; it will then give you boundaries on text. The
> > > character properties of the whole are extrapolated from the first code point.
> > >
> > > The Java interface is at:
> > >
> > > http://java.sun.com/products/jdk/1.2/docs/api/java/text/BreakIterator.html
> > >
> > > The ICU C++ interface is at
> > >
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/BreakIterator.html
> > >
> > > It also has the overall documentation.
> > >
> > > The ICU C interface is in separate routines all starting with "ubrk_", e.g. to open, iterate, and close you
> > > would use the following:
> > >
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_open.html
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_next.html
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_close.html
> > >
> > > For random access, you use:
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_following.html, or
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_preceding.html
> > >
> > > Mark
> > >
> > Tex Texin wrote:
> >
> > Dear Uni-people,
> >
> > I am of course a supporter and a benefiter of Unicode
> > and its many improvements over legacy encodings. As an application
> > implementer, and not a linguist, typographer, or nationalist (IE
> > not favoring one language or politics over another),
> > I look to Unicode to provide me with standardization so I can
> > provide world-wide plain-text support. I like that Unicode
> > defines algorithms for bidirectional support, character properties
> > and the like, and I am no longer in the business of researching
> > both code pages and the algorithms for using those code pages.
> > (Well, I do a lot less of it now anyway. ;-) )
> >
> > As a software designer, I need to understand and rely on some
> > basic principles. For example, I have to have a GetNextCharacter
> > routine. This list can go on and on (and on...) about abstract
> > characters, versus letters, versus graphemes, but I need to
> > implement something close to what a user expects, (or what I can
> > teach them to work with) and to have
> > an element, or basic unit, that I can manipulate and design to use
> > in my software.
> >
> > I can work with 16-bit units as Unicode defines them, and I can
> > program to either provide users with behaviors based on these
> > units (e.g. cursor-right moves through each unit, i.e. through
> > each diacritic, tone mark, etc.) or I can
> > provide users with a more complete element that users traditionally
> > think of as a character (e.g. cursor-right moves to the next letter).
> >
> > However, Unicode does not prescribe how I recognize or work with
> > with elements like "ch". As an implementer, I thought GetNextCharacter
> > would look for a 16-bit abstract character (or multi-16-bit
> > surrogate) followed by some non-spacing elements. Apparently, this
> > is not sufficient.
> >
> > I also thought I could look up properties of "characters" but I
> > don't know where to look up the properties of "ch".
> >
> > I imagine some of these "multi-spacing character" elements,
> > will cause me to redesign my approach to bidi as well.
> >
> > Although I recognize the many benefits of Unicode, if I cannot
> > understand how to reliably implement GetNextCharacter, and have
> > a property table for these "multi-spacing character" elements,
> > then many current designs for Unicode applications are now inadequate.
> > For developers, the greatest benefit of Unicode was it provided
> > standardization for the basic character element.
> > I suddenly feel thrown back into the multi-codepage
> > quagmire of researching researching, researching, and probably
> > continually revamping and re-generalizing my software to accomodate
> > new character types as I uncover them.
> >
> > I believe I do understand the rationales offered for why "ch"
> > should not be a character. I would rather the onus be shifted to
> > input methods and legacy conversion programs to determine whether
> > "ch" should be encoded as a single element or not, rather than
> > having all remaining software be continually analyzing this.
> >
> > I would not like for this note to kick off a repetition of everything
> > that has already been said many times. Perhaps it is inevitable.
> > I would really like a clear definition for knowing how many bytes
> > or 16-bit words to read to get to the end of the current "letter",
> > and how to look up the properties of that letter, its case, etc.
> > These are the basic elements I need to use in my software.
> > The algorithms for doing this, should be able to support Slovak and
> > other languages in a uniform way.
> >
> > tex
> > Progress Software: The #1 Embedded Database
> > -------------------------------------------------------------------------------------------------------
> > Tex Texin Director, International Products
> >
> > Progress Software Corp. Voice: +1-781-280-4271
> > 14 Oak Park Fax: +1-781-280-4949
> > Bedford, MA 01730 USA texin@bedford.progress.com
> >
> > http://www.progress.com http://apptivity.progress.com
> > -------------------------------------------------------------------------------------------------------

-- 
Progress Software: The #1 Embedded Database 
-------------------------------------------------------------------------------------------------------
Tex Texin                      Director, International Products
                                 
Progress Software Corp.        Voice:         +1-781-280-4271
14 Oak Park                      Fax:         +1-781-280-4949
Bedford, MA 01730  USA             texin@bedford.progress.com

http://www.progress.com http://apptivity.progress.com -------------------------------------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT