Re: "ch" as in yecch

From: Tex Texin (texin@progress.com)
Date: Sun Oct 24 1999 - 19:33:17 EDT


Mark,
Thanks Mark, I am familiar with the section and look forward
to the discussion on regular expressions.

What I am requesting, is that there be a list of all of the
graphemes that consist of multiple spacing characters, and the locales
(or languages) they are used in, and their properties.

For example, "ch" in Slovak, "ll" in Spanish. (These are European
and without combining characters, no? I know the modern Spanish
rules are changed.)

I want to be able to create a function that returns to me the
properties of a grapheme, and the grapheme can be any of:
a base character, a base plus non-spacing characters,
or base plus other bases as in "ch".

The reason I want this, is because where my software uses the
properties to validate text, or uses properties to decide behavior,
it shouldn't care how the grapheme is composed.
I might be checking for example whether the grapheme is upper or
lower case.

I guess there is also an assumption that I have which I should express,
that not every multi-spacing character grapheme is a letter.
I am not a linguist so I wouldn't be the one to prove or disprove
this assumption. One example, might be ellipsis "...".

I think if we had such a list, it might serve to reduce the requests
for multi-spacing character graphemes to be represented in Unicode
as single spacing characters, because the standard would then be
specifically addressing each one.

Even if the list was outside the standard, but available as an
informative document it would be a big help.

Michael Everson seems to be starting down this road with his
enumeration of the examples for !Xóõ and Hindi.

Tex

"Mark E. Davis" wrote:
>
> The standard characterizes it in 5.13, Locating Text Element Boundaries. That section is not normative, but if
> people follow it, they will interoperate. If your repertoire is limited to European characters, this means a base
> plus any number of combining marks--what people expect.
>
> Note: Unicode 3.0 clarifies this section further, and it is also discussed in TR18. (An updated version of TR18
> will be discussed at the UTC next week; it is included in the documents collected by Arnold.)
>
> Mark
>
> Tex Texin wrote:
>
> > Mark,
> > Thanks Mark, I am familiar with these. It finesses the
> > question though. Where does the standard define (for each locale
> > if needed), what is or isn't a character/letter/grapheme?
> >
> > If my application uses ICU, and some other application doesn't,
> > if we interchange data (client/server) will we get the same results
> > with the same queries/operations?
> >
> > Also, I noticed in a few places I wrote "multi-non-spacing characters"
> > where I meant "multi-spacing characters". ah well.
> >
> > tex
> >
> > "Mark E. Davis" wrote:
> > >
> > > You raise a very good point.
> > >
> > > What we use in ICU and in Java is a BreakIterator, that gives you 'character' boundaries (you can also
> > > choose word, line and sentence boundaries). Cursor movement is one area where character boundaries are
> > > useful; searching is another (if a search matches, you want to ensure that the boundaries of the match are
> > > character boundaries.
> > >
> > > You use getCharacterInstance(Locale) to get the iterator; it will then give you boundaries on text. The
> > > character properties of the whole are extrapolated from the first code point.
> > >
> > > The Java interface is at:
> > >
> > > http://java.sun.com/products/jdk/1.2/docs/api/java/text/BreakIterator.html
> > >
> > > The ICU C++ interface is at
> > >
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/BreakIterator.html
> > >
> > > It also has the overall documentation.
> > >
> > > The ICU C interface is in separate routines all starting with "ubrk_", e.g. to open, iterate, and close you
> > > would use the following:
> > >
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_open.html
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_next.html
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_close.html
> > >
> > > For random access, you use:
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_following.html, or
> > > http://www10.software.ibm.com/developerworks/opensource/icu/project/html/ubrk_preceding.html
> > >
> > > Mark
> > >
> > Tex Texin wrote:
> >
> > Dear Uni-people,
> >
> > I am of course a supporter and a benefiter of Unicode
> > and its many improvements over legacy encodings. As an application
> > implementer, and not a linguist, typographer, or nationalist (IE
> > not favoring one language or politics over another),
> > I look to Unicode to provide me with standardization so I can
> > provide world-wide plain-text support. I like that Unicode
> > defines algorithms for bidirectional support, character properties
> > and the like, and I am no longer in the business of researching
> > both code pages and the algorithms for using those code pages.
> > (Well, I do a lot less of it now anyway. ;-) )
> >
> > As a software designer, I need to understand and rely on some
> > basic principles. For example, I have to have a GetNextCharacter
> > routine. This list can go on and on (and on...) about abstract
> > characters, versus letters, versus graphemes, but I need to
> > implement something close to what a user expects, (or what I can
> > teach them to work with) and to have
> > an element, or basic unit, that I can manipulate and design to use
> > in my software.
> >
> > I can work with 16-bit units as Unicode defines them, and I can
> > program to either provide users with behaviors based on these
> > units (e.g. cursor-right moves through each unit, i.e. through
> > each diacritic, tone mark, etc.) or I can
> > provide users with a more complete element that users traditionally
> > think of as a character (e.g. cursor-right moves to the next letter).
> >
> > However, Unicode does not prescribe how I recognize or work with
> > with elements like "ch". As an implementer, I thought GetNextCharacter
> > would look for a 16-bit abstract character (or multi-16-bit
> > surrogate) followed by some non-spacing elements. Apparently, this
> > is not sufficient.
> >
> > I also thought I could look up properties of "characters" but I
> > don't know where to look up the properties of "ch".
> >
> > I imagine some of these "multi-spacing character" elements,
> > will cause me to redesign my approach to bidi as well.
> >
> > Although I recognize the many benefits of Unicode, if I cannot
> > understand how to reliably implement GetNextCharacter, and have
> > a property table for these "multi-spacing character" elements,
> > then many current designs for Unicode applications are now inadequate.
> > For developers, the greatest benefit of Unicode was it provided
> > standardization for the basic character element.
> > I suddenly feel thrown back into the multi-codepage
> > quagmire of researching researching, researching, and probably
> > continually revamping and re-generalizing my software to accomodate
> > new character types as I uncover them.
> >
> > I believe I do understand the rationales offered for why "ch"
> > should not be a character. I would rather the onus be shifted to
> > input methods and legacy conversion programs to determine whether
> > "ch" should be encoded as a single element or not, rather than
> > having all remaining software be continually analyzing this.
> >
> > I would not like for this note to kick off a repetition of everything
> > that has already been said many times. Perhaps it is inevitable.
> > I would really like a clear definition for knowing how many bytes
> > or 16-bit words to read to get to the end of the current "letter",
> > and how to look up the properties of that letter, its case, etc.
> > These are the basic elements I need to use in my software.
> > The algorithms for doing this, should be able to support Slovak and
> > other languages in a uniform way.
> >
> > tex
> > Progress Software: The #1 Embedded Database
> > -------------------------------------------------------------------------------------------------------
> > Tex Texin Director, International Products
> >
> > Progress Software Corp. Voice: +1-781-280-4271
> > 14 Oak Park Fax: +1-781-280-4949
> > Bedford, MA 01730 USA texin@bedford.progress.com
> >
> > http://www.progress.com http://apptivity.progress.com
> > -------------------------------------------------------------------------------------------------------

-- 
Progress Software: The #1 Embedded Database 
-------------------------------------------------------------------------------------------------------
Tex Texin                      Director, International Products
                                 
Progress Software Corp.        Voice:         +1-781-280-4271
14 Oak Park                      Fax:         +1-781-280-4949
Bedford, MA 01730  USA             texin@bedford.progress.com

http://www.progress.com http://apptivity.progress.com -------------------------------------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT