Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 26 2003 - 21:54:08 EDT

  • Next message: Kenneth Whistler: "Re: Biblical Hebrew"

    John,

    > At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
    >
    > >Why is making use of the existing behavior of existing characters
    > >a "groanable kludge", if it has the desired effect and makes
    > >the required distinctions in text? If there is not some
    > >rendering system or font lookup showstopper here, I'm inclined
    > >to think it's a rather elegant way out of the problem.
    >
    > I think assumptions about not breaking combining mark sequences may, in
    > fact, be a showstopper. If <base+mark+mark> becomes
    > <base+mark+CtrlChar+mark>, it is reasonable to think that this will not
    > only inhibit mark re-ordering but also mark combining and mark
    > interraction. Unfortunately, this seems to be the case with every control
    > character I have been able to test, using two different rendering engines
    > (Uniscribe and InDesign ME -- although the latter already has some problems
    > with double marks in Biblical Hebrew). Perhaps we should have a specific
    > COMBINING MARK SEQUENCE CONTROL character?

    Actually, in casting around for the solution to the problem of
    introduction of format controls creating defective combining
    character sequences, it finally occurred to me that:

    U+034F COMBINING GRAPHEME JOINER

    has the requisite properties.

    It is non-visible, does not affect the display of neighboring
    characters (except incidentally, if processes choose to recognize
    sequences containing it and process them distinctly), *AND*
    it is a *combining mark*, not a format control.

    Hence, the sequence:

        <lamed, patah, CGJ, hiriq>
           0 17 0 14
        
    is *not* a defective combining character sequence, by the
    definitions in the standard. The entire sequence of three
    combining marks would have to "apply" to the lamed, but
    the fact that CGJ has (cc=0) prevents the patah from reordering
    around the hiriq under normalization.

    Could this finally be the missing "killer ap" for the CGJ?

    >
    > All that said, I disagree with Ken that this is anything like an elegant
    > way out of the problem. Forcing awkward, textually illogical and easily
    > forgetable control character usage onto *users* in order to solve a problem
    > in the Unicode Standard is not elegant, and it is unlikely to do much for
    > the reputation of the standard.

    I don't understand this contention. There is no reason, in principle,
    why this has to be surfaced to end users of Biblical Hebrew, any
    more than messy details of embedding override controls has to be surfaced
    to end users in order to make an interface which will support end user
    control over direction in bidirectional text.

    If CGJ is the one, then the only *real* implementation requirement would
    be that CGJ be consistently inserted (for Biblical Hebrew) between
    any pair of points applied to the same consonant. Depending on the
    particular application, this could either be hidden behind the
    input method/keyboard and be actively managed by the software, or
    it could be applied as a filter to an export format, when exporting
    to contexts that might neutralize intended contrasts or result in
    the wrong display by the application of normalization.

    >
    > Q: 'Why do I have to insert this control character between these points?'
    > A: 'To prevent them from being re-ordered.'
    > Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in
    > the order I put them in?'
    > A: 'Because Unicode normalisation will automatically re-order the points.'
    > Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
    > A: 'Yes, but the people who decided how normalisation should work for
    > Hebrew didn't know that.'
    > Q: 'Well can't they fix it?'
    > A: 'They have: they've told you that you have to insert this control
    > character...'

    And that whole dialogue should be limited to the *programmers* only,
    whose job it is then to hide the details of how they get the
    magic to work from people who would find those details just confusing.

    > Q: 'But *I* didn't make the mistake. Why should I have to be the one to
    > mess around with this annoying control character?'
    >
    > ... and so on.
    >
    > Much as the duplication of Hebrew mark encoding may be distasteful, and
    > even considering the work that will need to be done to update layout
    > engines, fonts and documents to work with the new mark characters, I agree
    > with Peter Constable that this is by far the best long term solution,
    > especially from a *user* perspective.

    I have to disagree. It should be largely irrelevant to the user perspective.
    In this case (as in others) the users are the experts about what their
    expected requirements are for text behavior, and in particular, what
    distinctions need to be maintained. But they should not be expected
    to define the technical means for fulfilling those requirements, nor
    lean over the shoulders of the engineers to tell them how to write
    the software to accomplish it.

    > Over the past two months I have been
    > over this problem in great detail with the Society of Biblical Literature
    > and their partners in the SBL Font Foundation. They understand the problems
    > with the current normalisation, and they understand that any solution is
    > going to require document and font revisions; they're resigned to this, and
    > they've worked hard to come up with combining class assignments that would
    > actually work for all consonant + mark(s) sequences encountered in Biblical
    > Hebrew. This work forms the basis of the proposal submitted by Peter
    > Constable. Encoding of new Biblical Hebrew mark characters provides a
    > relatively simple update path for both documents and fonts, since it
    > largely involves one-to-one mappings from old characters to new.

    The alternative I've suggested is equally simple for the documents,
    since it would work by inserting CGJ between any pair of the relevant
    points, without otherwise changing any encoding. (Problem of an
    right-side meteg aside.)

    *IF* the implementations like Uniscribe do what they are supposed
    to for CGJ -- which is *not* a format control, but a combining
    character -- then the upgrade of existing fonts to account for
    the presence of CGJ should be equally straightforward. (I don't
    know if it would be *easy*.)

    >
    > Conversely, insisting on using control characters to manage mark ordering
    > in texts will require analysis to identify those sequences that will be
    > subject to re-ordering during normalisation, and individual insertion of
    > control characters.

    Nope, just insert CGJ in *all* the sequences. That blocks all reordering
    of such sequences, and you're done.

    > The fact that these control characters are invisible
    > and not obvious to users transcribing text, puts an additional burden on
    > application and font support,

    True. But then invisible bidi controls that are not obvious to
    users transcribing text also put an additional burden on
    applications.

    While I understand the straw on the camel's back argument, an
    "insert this everywhere and hide the details from the end users"
    approach wouldn't seem to be an unnecessary degree of additional
    complexity.

    > and adds another level of complexity to using
    > what are already some of the most complicated fonts in existence (how many
    > fonts do you know that come with 18 page user manuals?).

    That, of course, I am in no position to be able to judge.

    > I think it is
    > unreasonable to expect Biblical scholars to understand Unicode canonical
    > ordering to such a deep level that they are able to know where to insert
    > control characters to prevent a re-ordering that shouldn't be happening in
    > the first place.

    The approach I am suggesting doesn't require them to know anything
    about it. They can go back to what they thought they were going to
    do in the first place and forget all about canonical reordering.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jun 26 2003 - 22:35:26 EDT