From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 28 2003 - 18:32:46 EDT
Joan Wardell responded to:
> > Why can't we just fix the database? :)
>
KW:
> Because changing the canonical ordering classes (in ways not
> allowed by the stability policies) breaks the normalization
> *algorithm* and the expected test results it is tested against.
>
JW:
> If the "expected test results" are bad data, it shouldn't matter
> then if it is consistent.
O.k. Stop right there. The expected test results are, in fact,
*good* data. They accurately reflect the current statement of
the algorithm, which was the point.
> Are you
> saying that somewhere there are lots of people who have worked very hard to
> implement
> Hebrew as it is currently described in Unicode 3 and they would have to
> "start over" if we
> changed the canonical order? And the biggest fear is that the data today
> won't be
> consistent with the data in the new order?
No, I am not. And the fact that you and others arguing for the
canonical ordering change don't seem to recognize the distinction
is part of the reason why we appear to be talking past each other.
The reason why the UTC defends the stabilization of the Unicode
normalization specification is generic: it is the stability of
the specification itself which is at issue and which impacts
implementations in libraries, databases, applications, protocols, ...
In the case of people reporting that one or another particular
fixed position class doesn't result in optimal text representation
or ordering distinctions in combining marks for Hebrew, or Arabic,
or Burmese, or ..., those considerations are utterly beside the
point for stability of normalization per se. *Any* such changes
to "correct" behavior would result in what would be considered
by many others to be a fatal flaw in normalization itself.
That is why I have been assiduously promoting an alternate approach
(insertion of CGJ) which does *not* impact normalization, but which
gives Biblical Hebrew a straightforward means of representing
all the distinctions it needs to maintain, even in normalized
text.
> My point is that there *is* no
> data today,
> because anyone who has attempted to produce biblical Hebrew data in the
> current
> canonical order would have stopped and said "Wait a minute! This won't
> work".
It "won't work" (by which is meant, it won't maintain all the distinctions
you want to maintain in plain text, under the assumption that plain
text will be normalized) under certain assumptions about how
Biblical Hebrew data should be "spelled". It *will* work under other
assumptions about spelling, which is what the CGJ proposal is all
about.
>
> That's what I'm saying. And I have no particular problem with the CGJ
> suggestion, but
> it doesn't go far enough. I don't think we can use it to fix meteg, a mark
> which occurs in
> three different positions around a low vowel, yet is canonically ordered
> before the shin/sin
> dots! Will we put one CGJ on the right to indicate a right meteg and one on
> the left to indicate
> a left meteg?
No. I have no objection to encoding one more meteg character,
as has been proposed, if it is reliably distinguished from
the existing meteg. John Hudson has already argued that
that is enough to enable dealing with the rest of the
rendering distinctions contextually.
> There are many other examples of problems with the current
> canonical order.
Many other examples that aren't merely more examples of the
generic issue which can be addressed by CGJ insertion?
>
> The apparent simplest solution to all the problems is to correct the
> canonical order.
In this case the "apparent simplest" solution is actually the
worst, for the reasons I enumerated earlier in this thread.
> Yes, I am talking about the person writing a batch conversion from existing
> data into
> Unicode. That would be me. If you were only suggesting we insert one CGJ, I
> wouldn't complain.
O.k. Don't. ;-)
> But we are looking at re-writing the font, the keyboards, and the
> conversion so that we can
> work around the numerous problems with canonical order. I am selfishly
> preferring that
> you "normalizers" re-write your code. :)
I understand the impetus for this. It would be wonderful if
the UTC could wave a magic wand over this, and then at such-and-such
a date the problem would just go away.
But while, sure, I can locate the particular places in the
code for my own library implementation of normalization where the
canonical combining classes for hiriq and patah are defined, and
yes, it would be a simple matter for me to change two numbers
there, here is *my* point: that doesn't fix the problem. It
creates a new version of normalization incompatible with the
last version, and while I can control the two numbers in my
own source code, I can*not* control the worldwide deployment
of everybody's normalization code in infrastructure, applications,
and protocols. All I could do at that point would be to watch
(in either ignorance or horror) as incompatible versions of
normalization, rolled out asynchronously over time, started
creating interoperability problems.
*You* should, in fact, be concerned about such a prospect, because
it is the Biblical Hebrew data which would be most impacted by
inconsistent, dueling versions of Unicode normalization, if it ever
came to that.
--Ken
This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 19:01:30 EDT