Re: Merging combining classes, was: New contribution N2676

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Oct 27 2003 - 06:01:56 CST


On 26/10/2003 12:51, Jony Rosenne wrote:

>While the current combining classes may cause some difficulties for Biblical
>scholars (and this isn't cut and dry yet - it isn't certain whether these
>are Unicode problem, implementation problems, missing characters or
>mis-identified characters), I have yet to see a claimed problem with pointed
>Hebrew - I mean just the points, without cantillation marks, as used for
>non-Biblical texts. And I don't count Microsoft's strange implementation
>mentioned yesterday as a Unicode problem.
>
>Jony
>
>
>
My understanding is that most of the problems with Unicode Hebrew are in
fact with the points which are sometimes used with modern Hebrew, rather
than with the accents or cantillation marks. The combining classes for
the latter, apart from meteg, are mostly correctly assigned, and
although there are some small issues which can be resolved, and a couple
of badly misleading character names, there are no major problems.

The major combining class related problems with Unicode Hebrew are
concerned with:

1) Meteg - basically part of the accent system but it was encoded in the
"Points and Punctuation" sub-block, based on Israeli standards, because
it is sometimes used in modern Hebrew. In this marginal modern Hebrew
use of meteg it is always positioned to the left of any vowel. The
problem arises in biblical Hebrew because meteg is sometimes positioned
to the right of a vowel, and also because its order relative to accents
is sometimes significant.

2) Cases of two vowels with one base character, mostly but not only in
the defective form Yerushala(y)im. These are a problem in the biblical
text and also, as Mark just pointed out, in biblical extracts quoted in
modern Hebrew.

3) This is the issue which causes significant problems in pointed modern
Hebrew as well as in the biblical text: Hebrew consonants are commonly
combined with dagesh (a dot in the middle of the letter) and a vowel
point; the consonant shin is additionally combined, in pointed text,
with either sin dot or shin dot; and meteg may be added, though only
occasionally in modern Hebrew. Logically, and commonly for typing
purposes, the sin or shin dot combines most closely with the consonant
(cf. cedilla and the inseparable dots on many Arabic letters); then the
dagesh, which modifies the pronunciation of (almost) any consonant (cf.
Arabic shadda and the IPA length mark U+02D0 - all are commonly
transliterated by doubling the consonant); then the vowel, which is
pronounced separately after the consonant; then the meteg which
effectively modifies or disambiguates the vowel. So the logical order is
<shin, sin/shin dot, dagesh, vowel, meteg>. But the canonical order is
<shin, vowel, dagesh, meteg, sin/shin dot>; up to three (and in theory
more, at least in biblical Hebrew) other characters may appear between
the base letter and the dot which fundamentally modifies it.

Jony, this is the problem which I claim, and have claimed before, which
affects pointed modern Hebrew just as much as the biblical text. But the
question is, is it really a problem? As Ken Whistler has written in
http://www.unicode.org/faq/normalization.html, "The Unicode Standard
does not guarantee that the canonical ordering of a combining character
sequence for any particular script is the 'correct' order from a
linguistic point of view".

For rendering, there is no problem as long as a rendering engine does
what the Unicode standard (4.0 p.127) tells it to do:

> Canonical equivalence must be taken into account in rendering multiple
> accents, so that any two canonically equivalent sequences display as
> the same.

- or at least the problem is reduced to one of efficiency. But it seems
that certain software companies have decided that modern Hebrew users
prefer to see normalised text rendered quickly but incorrectly rather
than slightly more slowly but correctly. I wonder if they have consulted
with people like you, Jony, before making that decision. Perhaps they
think that this is an issue for biblical Hebrew only, but it is not.

I have just tested whether the sequences (in canonical order) <shin,
patah, dagesh, shin dot> and <shin, patah, dagesh, meteg, shin dot> are
rendered correctly in Windows 2000 by Uniscribe (version 1.468.4015.0)
and a variety of fonts. SBL Hebrew (draft) and Guttmann David render the
former correctly, because there is no positioning adjustment required in
this case (and so even Times New Roman and Arial Unicode MS render
correctly), but Ezra SIL misplaces the dagesh, Code2000 misplaces the
shin dot, and Vusillus (draft) misplaces both. But when the meteg is
added, none of these fonts are able to make the proper positioning
adjustments; but Ezra SIL, SBL Hebrew and Vusillus give correct results
for the logical order <shin, shin dot, dagesh, patah, meteg>. The
problem is that the reordering which the rendering engine should be
doing is being passed to the fonts, although it is a task which the
OpenType fonts cannot do at least without very complex and inefficient code.

And then the issue is not just one of rendering. There are also issues
of searching and sorting. If I want to search for the letter sin, i.e.
shin with sin dot, with the current canonical order that search needs to
be able to find a discontinuous string with three or more intervening
characters. That is certainly grossly inefficient, I'm not even sure if
it will work at all. As for collation, as we have discussed before there
need to be some seriously complex combinations in the collation data,
for default or tailored collation, so that shin/sin dot and dagesh are
collated either at a higher level than or simply as before the logically
following vowel point.

The issue might have been simplified if U+FB2A to U+FB4A had not been
defined as composition exceptions. I mention this only because there is
a precedent for changing the composition exception table.

Jony, I hope you now realise that the problems do in principle affect
modern Hebrew. If they have not been noticed so far it is only because
people have not yet been normalising text very often. But as XML becomes
widespread and its normalisation recommendations are incorporated into
software, text will start being normalised unexpectedly, and Israeli
readers of pointed Hebrew on the Web etc will quickly start to complain
that documents cannot be viewed or searched properly. The problem is
coming, and won't go away simply by being ignored.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST