Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (
Date: Mon Jul 28 2003 - 18:32:46 EDT

  • Next message: Peter Kirk: "Re: Back to Hebrew, was OT:darn'd fools"

    Joan Wardell responded to:

    > > Why can't we just fix the database? :)

    > Because changing the canonical ordering classes (in ways not
    > allowed by the stability policies) breaks the normalization
    > *algorithm* and the expected test results it is tested against.

    > If the "expected test results" are bad data, it shouldn't matter
    > then if it is consistent.

    O.k. Stop right there. The expected test results are, in fact,
    *good* data. They accurately reflect the current statement of
    the algorithm, which was the point.

    > Are you
    > saying that somewhere there are lots of people who have worked very hard to
    > implement
    > Hebrew as it is currently described in Unicode 3 and they would have to
    > "start over" if we
    > changed the canonical order? And the biggest fear is that the data today
    > won't be
    > consistent with the data in the new order?

    No, I am not. And the fact that you and others arguing for the
    canonical ordering change don't seem to recognize the distinction
    is part of the reason why we appear to be talking past each other.

    The reason why the UTC defends the stabilization of the Unicode
    normalization specification is generic: it is the stability of
    the specification itself which is at issue and which impacts
    implementations in libraries, databases, applications, protocols, ...
    In the case of people reporting that one or another particular
    fixed position class doesn't result in optimal text representation
    or ordering distinctions in combining marks for Hebrew, or Arabic,
    or Burmese, or ..., those considerations are utterly beside the
    point for stability of normalization per se. *Any* such changes
    to "correct" behavior would result in what would be considered
    by many others to be a fatal flaw in normalization itself.

    That is why I have been assiduously promoting an alternate approach
    (insertion of CGJ) which does *not* impact normalization, but which
    gives Biblical Hebrew a straightforward means of representing
    all the distinctions it needs to maintain, even in normalized

    > My point is that there *is* no
    > data today,
    > because anyone who has attempted to produce biblical Hebrew data in the
    > current
    > canonical order would have stopped and said "Wait a minute! This won't
    > work".

    It "won't work" (by which is meant, it won't maintain all the distinctions
    you want to maintain in plain text, under the assumption that plain
    text will be normalized) under certain assumptions about how
    Biblical Hebrew data should be "spelled". It *will* work under other
    assumptions about spelling, which is what the CGJ proposal is all

    > That's what I'm saying. And I have no particular problem with the CGJ
    > suggestion, but
    > it doesn't go far enough. I don't think we can use it to fix meteg, a mark
    > which occurs in
    > three different positions around a low vowel, yet is canonically ordered
    > before the shin/sin
    > dots! Will we put one CGJ on the right to indicate a right meteg and one on
    > the left to indicate
    > a left meteg?

    No. I have no objection to encoding one more meteg character,
    as has been proposed, if it is reliably distinguished from
    the existing meteg. John Hudson has already argued that
    that is enough to enable dealing with the rest of the
    rendering distinctions contextually.

    > There are many other examples of problems with the current
    > canonical order.

    Many other examples that aren't merely more examples of the
    generic issue which can be addressed by CGJ insertion?

    > The apparent simplest solution to all the problems is to correct the
    > canonical order.

    In this case the "apparent simplest" solution is actually the
    worst, for the reasons I enumerated earlier in this thread.

    > Yes, I am talking about the person writing a batch conversion from existing
    > data into
    > Unicode. That would be me. If you were only suggesting we insert one CGJ, I
    > wouldn't complain.

    O.k. Don't. ;-)

    > But we are looking at re-writing the font, the keyboards, and the
    > conversion so that we can
    > work around the numerous problems with canonical order. I am selfishly
    > preferring that
    > you "normalizers" re-write your code. :)

    I understand the impetus for this. It would be wonderful if
    the UTC could wave a magic wand over this, and then at such-and-such
    a date the problem would just go away.

    But while, sure, I can locate the particular places in the
    code for my own library implementation of normalization where the
    canonical combining classes for hiriq and patah are defined, and
    yes, it would be a simple matter for me to change two numbers
    there, here is *my* point: that doesn't fix the problem. It
    creates a new version of normalization incompatible with the
    last version, and while I can control the two numbers in my
    own source code, I can*not* control the worldwide deployment
    of everybody's normalization code in infrastructure, applications,
    and protocols. All I could do at that point would be to watch
    (in either ignorance or horror) as incompatible versions of
    normalization, rolled out asynchronously over time, started
    creating interoperability problems.

    *You* should, in fact, be concerned about such a prospect, because
    it is the Biblical Hebrew data which would be most impacted by
    inconsistent, dueling versions of Unicode normalization, if it ever
    came to that.


    This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 19:01:30 EDT