From: Joan_Wardell@sil.org
Date: Tue Jul 29 2003 - 15:42:10 EDT
Ken,
I am trying to get a grasp on the problem. Thanks for your explanations. If
you continue typing slowly enough, perhaps it will eventually get through.
>>And the fact that you and others arguing for the
canonical ordering change don't seem to recognize the distinction
is part of the reason why we appear to be talking past each other.
I agree.
But the implications of keeping the current canonical order are also
staggering. It seems there must be extra rules* for biblical Hebrew which
will have to be written into every keyboard, search engine, and conversion
table.
For example, if someone wants to search for "laim", the keyboard will have
to insert a character such as CGJ, between the two vowels before searching
the normalized data. If the keyboard doesn't know about the required CGJ,
then the search engine must insert it before searching. The search engine
returns the results with the CGJ and the font used to display must know how
to handle it. Also Uniscribe must know.
Ultimately, it seems that every process will have to recognize and maintain
only normalized data. Or am I off-base?
And how will the keyboard know when to insert a CGJ? The user is not
supposed to know about it. So will we program the keyboard to recognize all
forms of "Yerushalaim"? Or perhaps we will just always insert CGJ between
any two vowels? To me, the problem is expanding exponentially.
> There are many other examples of problems with the current
> canonical order.
Many other examples that aren't merely more examples of the
generic issue which can be addressed by CGJ insertion?
Short List of *Extra Rules or Things I Need a Solution for"
right meteg
left meteg after a hataf vowel
Upper Punctum
Lower Punctum
Upper Double (thousands) dot, if 05C4 is the upper single (hundreds) dot
Reversed Nun
Any sequence of two vowels, including "laim" example
Any second vowel, such as for alternate pronunciation, which appears after
the final low cant - thus a vowel-cant-vowel sequence
Another example I believe is the Adonai vowel markings on the name
of God
The current mix of high-low, left and right is extraordinarily and
inordinately complex, as if it were intended to be impossible to program.
The top 6 can be handled by adding characters to the Unicode set for
Hebrew, if the canonical classes are set reasonably. In the meantime, we
are trying to substitute Latin marks in the 0300 series, but there seem to
be conflicts there. We've talked about inserting a control character and
perhaps that would work on the next two problems, although it is not
working at present.
I would really have to go back and re-think the entire project if I were to
accept canonical order as the required store order, rather than the sort
order it was designed to be.
Joan Wardell
NRSI-SIL
Kenneth Whistler
<kenw@sybase.com To: Joan_Wardell@sil.org
> cc: unicode@unicode.org, kenw@sybase.com
Subject: Re: Yerushala(y)im - or Biblical Hebrew
07/28/2003 05:32
PM
Please respond
to Kenneth
Whistler
Joan Wardell responded to:
> > Why can't we just fix the database? :)
>
KW:
> Because changing the canonical ordering classes (in ways not
> allowed by the stability policies) breaks the normalization
> *algorithm* and the expected test results it is tested against.
>
JW:
> If the "expected test results" are bad data, it shouldn't matter
> then if it is consistent.
O.k. Stop right there. The expected test results are, in fact,
*good* data. They accurately reflect the current statement of
the algorithm, which was the point.
> Are you
> saying that somewhere there are lots of people who have worked very hard
to
> implement
> Hebrew as it is currently described in Unicode 3 and they would have to
> "start over" if we
> changed the canonical order? And the biggest fear is that the data today
> won't be
> consistent with the data in the new order?
No, I am not. And the fact that you and others arguing for the
canonical ordering change don't seem to recognize the distinction
is part of the reason why we appear to be talking past each other.
The reason why the UTC defends the stabilization of the Unicode
normalization specification is generic: it is the stability of
the specification itself which is at issue and which impacts
implementations in libraries, databases, applications, protocols, ...
In the case of people reporting that one or another particular
fixed position class doesn't result in optimal text representation
or ordering distinctions in combining marks for Hebrew, or Arabic,
or Burmese, or ..., those considerations are utterly beside the
point for stability of normalization per se. *Any* such changes
to "correct" behavior would result in what would be considered
by many others to be a fatal flaw in normalization itself.
That is why I have been assiduously promoting an alternate approach
(insertion of CGJ) which does *not* impact normalization, but which
gives Biblical Hebrew a straightforward means of representing
all the distinctions it needs to maintain, even in normalized
text.
> My point is that there *is* no
> data today,
> because anyone who has attempted to produce biblical Hebrew data in the
> current
> canonical order would have stopped and said "Wait a minute! This won't
> work".
It "won't work" (by which is meant, it won't maintain all the distinctions
you want to maintain in plain text, under the assumption that plain
text will be normalized) under certain assumptions about how
Biblical Hebrew data should be "spelled". It *will* work under other
assumptions about spelling, which is what the CGJ proposal is all
about.
>
> That's what I'm saying. And I have no particular problem with the CGJ
> suggestion, but
> it doesn't go far enough. I don't think we can use it to fix meteg, a
mark
> which occurs in
> three different positions around a low vowel, yet is canonically ordered
> before the shin/sin
> dots! Will we put one CGJ on the right to indicate a right meteg and one
on
> the left to indicate
> a left meteg?
No. I have no objection to encoding one more meteg character,
as has been proposed, if it is reliably distinguished from
the existing meteg. John Hudson has already argued that
that is enough to enable dealing with the rest of the
rendering distinctions contextually.
> There are many other examples of problems with the current
> canonical order.
Many other examples that aren't merely more examples of the
generic issue which can be addressed by CGJ insertion?
>
> The apparent simplest solution to all the problems is to correct the
> canonical order.
In this case the "apparent simplest" solution is actually the
worst, for the reasons I enumerated earlier in this thread.
> Yes, I am talking about the person writing a batch conversion from
existing
> data into
> Unicode. That would be me. If you were only suggesting we insert one CGJ,
I
> wouldn't complain.
O.k. Don't. ;-)
> But we are looking at re-writing the font, the keyboards, and the
> conversion so that we can
> work around the numerous problems with canonical order. I am selfishly
> preferring that
> you "normalizers" re-write your code. :)
I understand the impetus for this. It would be wonderful if
the UTC could wave a magic wand over this, and then at such-and-such
a date the problem would just go away.
But while, sure, I can locate the particular places in the
code for my own library implementation of normalization where the
canonical combining classes for hiriq and patah are defined, and
yes, it would be a simple matter for me to change two numbers
there, here is *my* point: that doesn't fix the problem. It
creates a new version of normalization incompatible with the
last version, and while I can control the two numbers in my
own source code, I can*not* control the worldwide deployment
of everybody's normalization code in infrastructure, applications,
and protocols. All I could do at that point would be to watch
(in either ignorance or horror) as incompatible versions of
normalization, rolled out asynchronously over time, started
creating interoperability problems.
*You* should, in fact, be concerned about such a prospect, because
it is the Biblical Hebrew data which would be most impacted by
inconsistent, dueling versions of Unicode normalization, if it ever
came to that.
--Ken
This archive was generated by hypermail 2.1.5 : Tue Jul 29 2003 - 16:32:31 EDT