Re: IPA and sorting

From: Mark Davis (mark_davis@taligent.com)
Date: Tue Sep 30 1997 - 21:23:23 EDT


The character to be used for indicating word separation is the ZWNBS,
not the ZWNJ.

The ZWNBS can be sorted similarly across these languages without
problem.

Mark

Martin J. Dürst wrote:

> On Tue, 23 Sep 1997, Kenneth Whistler wrote:
>
> > Michael Everson has suggested:
> > >
> > > In the Standard there are letters, used with the IPA like LATIN
> SMALL
> > > LETTER ALPHA which sorts with LATIN SMALL LETTER A -- but the
> current
> > > mappings to IPA also use GREEK SMALL LETTER BETA as a basic
> constituent of
> > > the IPA.
> > >
> > > This will cause havoc in sorting -- and one does sort IPA text, in
>
> > > glossaries etc. -- because two scripts are intermixed.
>
> > The problem with this, as for many other "clone a character to make
> > the processing for XXX easier" proposals, is that it has a
> downside--
> > how to keep the two different character straight once they are
> cloned.
>
> > A preferable solution is to define IPA collation distinctly from
> > the default collation for either Latin or Greek. That would allow
> > it to be defined more correctly for IPA specifically. This is really
>
> > no different from the special collation overrides required to get
> > correct collation for French, Swedish, Japanese, or whatever.
> > The default collation rules are just that: default. They don't
> > have to be perfect for everything, and in fact cannot be.
>
> I think the problem may lay one layer higher. One may want to
> sort IPA with Latin, or as a separate block. This usually
> doesn't appear e.g. for French and Swedish, i.e. they are
> sorted together, on whatever rules the viewer wants.
> We then get the problem that some characters can be in
> more than one block. But I just met a case recently where
> I realized that we already might have that problem. As
> an examlpe, ZWNJ is used in Thai and Khmer to indicate
> wordbreaks. For words and phrases in dictionaries, it is
> relevant and has to sort before the other letters. For
> Arabic, however, I guess it's irrelevant, because it only
> affects presentation.
>
> This means that sorting algorithms of a certain level
> of sophistication would have to base block decisions
> on strings of characters and not on individual codepoints.
> For ZWNJ and Thai/Arabic, that shouldn't be too difficult.
> For IPA and Latin, it may still be possible, although there
> may be cases where an easy distinction between an almost-
> Latin-looking IPA string and a Latin string with some
> "exotic" additions for a specific language may not be
> possible.
>
> So I think that we should rather think about brodeing
> our sorting model than just duplicate more codepoints.
> That some of them are already duplicated may not be
> optimal. But in the IPA section, I only saw epsilon;
> gamma there looks different from the Greek gamma.
> I didn't find Latin alpha; it may be somewhere else,
> for a proper language and not (only) for IPA.
>
> Regards, Martin.





This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT