Re: indexing of various langauges

From: Gary Grosso (gpg@arbortext.com)
Date: Fri Jul 25 1997 - 13:10:35 EDT

Next message: John Cowan: "Re: SGML entities for Unicode characters"
Previous message: Martin J. Duerst: "Re: indexing of various langauges"
Maybe in reply to: Gary Grosso: "indexing of various langauges"
Next in thread: Mark Davis: "Re: indexing of various langauges"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi all,

Thanks for your responses so far on this thread.

It is true that to handle an index consisting of an arbitrary mixture of
languages, we would certainly need more than 255 primary sort characters.
It's debatable who would want such an index, but as the world becomes more
internationalized, people may want such a thing as opposed to having many
indexes, one for each language.

I think that it is possible to index Japanese and Chinese within the 255
character limitation, since they are generally indexed by hiragana and
a romanization (such as pinyin), respectively. On the other hand, when the
user forgets, for example, to supply the hiragana equivalent for some Kanji
that they are indexing, it must be handled gracefully, and one solution is
just to index it directly as the Kanji, even though no one wants this
result in actual typographic practice.

Anyway, my colleague who is doing the actual coding on this area of our
project decided to give us the best mix of efficiency and usablity.
From his response to Jim Agenbroad:

  According to our examples, and the Unicode Standard 2.0 (page 6-62),
  the "unit of collation" for Korean is the Hangul syllable block (the
  molecule). However, the jamo (the phonetic atoms) can be used for a
  binary sort, after the syllables are decomposed. This allows the
  number of sorting weights to be far less than 255.

  The headings seem to be the leading consonants (kiyok, niun, tigut,
  etc.).

  I am not familiar with Amharic script, and I find no mention of it in
  the Unicode Standard. My almanac tells me Amharic is spoken in
  Ethiopia. I have modified our algorithm to switch automatically
  between one-byte and two-byte weights, so we will be able to
  accommodate whatever Amharic requires. So far, we don't seem to have
  much demand for it.

  Responding for Gary Grosso,

  Paul Winder

  ArborText, Inc.
  Ann Arbor MI, USA
  pwinder@arbortext.com

I would like to add that we find (mostly "lurking" on) this list very helpful.

Gary Grosso ArborText, Inc. Ann Arbor, MI, USA gpg@arbortext.com

Next message: John Cowan: "Re: SGML entities for Unicode characters"
Previous message: Martin J. Duerst: "Re: indexing of various langauges"
Maybe in reply to: Gary Grosso: "indexing of various langauges"
Next in thread: Mark Davis: "Re: indexing of various langauges"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT