Re: Yerushala(y)im - or Biblical Hebrew

Date: Tue Jul 29 2003 - 15:42:10 EDT

  • Next message: "Re: Yerushala(y)im - or Biblical Hebrew: meteg"

    I am trying to get a grasp on the problem. Thanks for your explanations. If
    you continue typing slowly enough, perhaps it will eventually get through.

    >>And the fact that you and others arguing for the
    canonical ordering change don't seem to recognize the distinction
    is part of the reason why we appear to be talking past each other.

    I agree.

    But the implications of keeping the current canonical order are also
    staggering. It seems there must be extra rules* for biblical Hebrew which
    will have to be written into every keyboard, search engine, and conversion

    For example, if someone wants to search for "laim", the keyboard will have
    to insert a character such as CGJ, between the two vowels before searching
    the normalized data. If the keyboard doesn't know about the required CGJ,
    then the search engine must insert it before searching. The search engine
    returns the results with the CGJ and the font used to display must know how
    to handle it. Also Uniscribe must know.

    Ultimately, it seems that every process will have to recognize and maintain
    only normalized data. Or am I off-base?

    And how will the keyboard know when to insert a CGJ? The user is not
    supposed to know about it. So will we program the keyboard to recognize all
    forms of "Yerushalaim"? Or perhaps we will just always insert CGJ between
    any two vowels? To me, the problem is expanding exponentially.

    > There are many other examples of problems with the current
    > canonical order.

    Many other examples that aren't merely more examples of the
    generic issue which can be addressed by CGJ insertion?

    Short List of *Extra Rules or Things I Need a Solution for"

    right meteg
    left meteg after a hataf vowel
    Upper Punctum
    Lower Punctum
    Upper Double (thousands) dot, if 05C4 is the upper single (hundreds) dot
    Reversed Nun
    Any sequence of two vowels, including "laim" example
    Any second vowel, such as for alternate pronunciation, which appears after
    the final low cant - thus a vowel-cant-vowel sequence
             Another example I believe is the Adonai vowel markings on the name
    of God
    The current mix of high-low, left and right is extraordinarily and
    inordinately complex, as if it were intended to be impossible to program.

    The top 6 can be handled by adding characters to the Unicode set for
    Hebrew, if the canonical classes are set reasonably. In the meantime, we
    are trying to substitute Latin marks in the 0300 series, but there seem to
    be conflicts there. We've talked about inserting a control character and
    perhaps that would work on the next two problems, although it is not
    working at present.

    I would really have to go back and re-think the entire project if I were to
    accept canonical order as the required store order, rather than the sort
    order it was designed to be.

    Joan Wardell

                          Kenneth Whistler
                          < To:
    > cc:,
                                                   Subject: Re: Yerushala(y)im - or Biblical Hebrew
                          07/28/2003 05:32
                          Please respond
                          to Kenneth

    Joan Wardell responded to:

    > > Why can't we just fix the database? :)

    > Because changing the canonical ordering classes (in ways not
    > allowed by the stability policies) breaks the normalization
    > *algorithm* and the expected test results it is tested against.

    > If the "expected test results" are bad data, it shouldn't matter
    > then if it is consistent.

    O.k. Stop right there. The expected test results are, in fact,
    *good* data. They accurately reflect the current statement of
    the algorithm, which was the point.

    > Are you
    > saying that somewhere there are lots of people who have worked very hard
    > implement
    > Hebrew as it is currently described in Unicode 3 and they would have to
    > "start over" if we
    > changed the canonical order? And the biggest fear is that the data today
    > won't be
    > consistent with the data in the new order?

    No, I am not. And the fact that you and others arguing for the
    canonical ordering change don't seem to recognize the distinction
    is part of the reason why we appear to be talking past each other.

    The reason why the UTC defends the stabilization of the Unicode
    normalization specification is generic: it is the stability of
    the specification itself which is at issue and which impacts
    implementations in libraries, databases, applications, protocols, ...
    In the case of people reporting that one or another particular
    fixed position class doesn't result in optimal text representation
    or ordering distinctions in combining marks for Hebrew, or Arabic,
    or Burmese, or ..., those considerations are utterly beside the
    point for stability of normalization per se. *Any* such changes
    to "correct" behavior would result in what would be considered
    by many others to be a fatal flaw in normalization itself.

    That is why I have been assiduously promoting an alternate approach
    (insertion of CGJ) which does *not* impact normalization, but which
    gives Biblical Hebrew a straightforward means of representing
    all the distinctions it needs to maintain, even in normalized

    > My point is that there *is* no
    > data today,
    > because anyone who has attempted to produce biblical Hebrew data in the
    > current
    > canonical order would have stopped and said "Wait a minute! This won't
    > work".

    It "won't work" (by which is meant, it won't maintain all the distinctions
    you want to maintain in plain text, under the assumption that plain
    text will be normalized) under certain assumptions about how
    Biblical Hebrew data should be "spelled". It *will* work under other
    assumptions about spelling, which is what the CGJ proposal is all

    > That's what I'm saying. And I have no particular problem with the CGJ
    > suggestion, but
    > it doesn't go far enough. I don't think we can use it to fix meteg, a
    > which occurs in
    > three different positions around a low vowel, yet is canonically ordered
    > before the shin/sin
    > dots! Will we put one CGJ on the right to indicate a right meteg and one
    > the left to indicate
    > a left meteg?

    No. I have no objection to encoding one more meteg character,
    as has been proposed, if it is reliably distinguished from
    the existing meteg. John Hudson has already argued that
    that is enough to enable dealing with the rest of the
    rendering distinctions contextually.

    > There are many other examples of problems with the current
    > canonical order.

    Many other examples that aren't merely more examples of the
    generic issue which can be addressed by CGJ insertion?

    > The apparent simplest solution to all the problems is to correct the
    > canonical order.

    In this case the "apparent simplest" solution is actually the
    worst, for the reasons I enumerated earlier in this thread.

    > Yes, I am talking about the person writing a batch conversion from
    > data into
    > Unicode. That would be me. If you were only suggesting we insert one CGJ,
    > wouldn't complain.

    O.k. Don't. ;-)

    > But we are looking at re-writing the font, the keyboards, and the
    > conversion so that we can
    > work around the numerous problems with canonical order. I am selfishly
    > preferring that
    > you "normalizers" re-write your code. :)

    I understand the impetus for this. It would be wonderful if
    the UTC could wave a magic wand over this, and then at such-and-such
    a date the problem would just go away.

    But while, sure, I can locate the particular places in the
    code for my own library implementation of normalization where the
    canonical combining classes for hiriq and patah are defined, and
    yes, it would be a simple matter for me to change two numbers
    there, here is *my* point: that doesn't fix the problem. It
    creates a new version of normalization incompatible with the
    last version, and while I can control the two numbers in my
    own source code, I can*not* control the worldwide deployment
    of everybody's normalization code in infrastructure, applications,
    and protocols. All I could do at that point would be to watch
    (in either ignorance or horror) as incompatible versions of
    normalization, rolled out asynchronously over time, started
    creating interoperability problems.

    *You* should, in fact, be concerned about such a prospect, because
    it is the Biblical Hebrew data which would be most impacted by
    inconsistent, dueling versions of Unicode normalization, if it ever
    came to that.


    This archive was generated by hypermail 2.1.5 : Tue Jul 29 2003 - 16:32:31 EDT