Re: IPA and sorting

From: Mark Davis (
Date: Tue Sep 30 1997 - 21:23:23 EDT

The character to be used for indicating word separation is the ZWNBS,
not the ZWNJ.

The ZWNBS can be sorted similarly across these languages without


Martin J. Dürst wrote:

> On Tue, 23 Sep 1997, Kenneth Whistler wrote:
> > Michael Everson has suggested:
> > >
> > > In the Standard there are letters, used with the IPA like LATIN
> > > LETTER ALPHA which sorts with LATIN SMALL LETTER A -- but the
> current
> > > mappings to IPA also use GREEK SMALL LETTER BETA as a basic
> constituent of
> > > the IPA.
> > >
> > > This will cause havoc in sorting -- and one does sort IPA text, in
> > > glossaries etc. -- because two scripts are intermixed.
> > The problem with this, as for many other "clone a character to make
> > the processing for XXX easier" proposals, is that it has a
> downside--
> > how to keep the two different character straight once they are
> cloned.
> > A preferable solution is to define IPA collation distinctly from
> > the default collation for either Latin or Greek. That would allow
> > it to be defined more correctly for IPA specifically. This is really
> > no different from the special collation overrides required to get
> > correct collation for French, Swedish, Japanese, or whatever.
> > The default collation rules are just that: default. They don't
> > have to be perfect for everything, and in fact cannot be.
> I think the problem may lay one layer higher. One may want to
> sort IPA with Latin, or as a separate block. This usually
> doesn't appear e.g. for French and Swedish, i.e. they are
> sorted together, on whatever rules the viewer wants.
> We then get the problem that some characters can be in
> more than one block. But I just met a case recently where
> I realized that we already might have that problem. As
> an examlpe, ZWNJ is used in Thai and Khmer to indicate
> wordbreaks. For words and phrases in dictionaries, it is
> relevant and has to sort before the other letters. For
> Arabic, however, I guess it's irrelevant, because it only
> affects presentation.
> This means that sorting algorithms of a certain level
> of sophistication would have to base block decisions
> on strings of characters and not on individual codepoints.
> For ZWNJ and Thai/Arabic, that shouldn't be too difficult.
> For IPA and Latin, it may still be possible, although there
> may be cases where an easy distinction between an almost-
> Latin-looking IPA string and a Latin string with some
> "exotic" additions for a specific language may not be
> possible.
> So I think that we should rather think about brodeing
> our sorting model than just duplicate more codepoints.
> That some of them are already duplicated may not be
> optimal. But in the IPA section, I only saw epsilon;
> gamma there looks different from the Greek gamma.
> I didn't find Latin alpha; it may be somewhere else,
> for a proper language and not (only) for IPA.
> Regards, Martin.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT