Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jun 26 2003 - 21:54:08 EDT

Next message: Kenneth Whistler: "Re: Biblical Hebrew"

Previous message: Mark Davis: "Re: Biblical Hebrew"
Maybe in reply to: Kenneth Whistler: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Next in thread: Peter_Constable@sil.org: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Reply: Peter_Constable@sil.org: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Reply: Philippe Verdy: "Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

John,

> At 03:36 PM 6/26/2003, Kenneth Whistler wrote:
>
> >Why is making use of the existing behavior of existing characters
> >a "groanable kludge", if it has the desired effect and makes
> >the required distinctions in text? If there is not some
> >rendering system or font lookup showstopper here, I'm inclined
> >to think it's a rather elegant way out of the problem.
>
> I think assumptions about not breaking combining mark sequences may, in
> fact, be a showstopper. If <base+mark+mark> becomes
> <base+mark+CtrlChar+mark>, it is reasonable to think that this will not
> only inhibit mark re-ordering but also mark combining and mark
> interraction. Unfortunately, this seems to be the case with every control
> character I have been able to test, using two different rendering engines
> (Uniscribe and InDesign ME -- although the latter already has some problems
> with double marks in Biblical Hebrew). Perhaps we should have a specific
> COMBINING MARK SEQUENCE CONTROL character?

Actually, in casting around for the solution to the problem of
introduction of format controls creating defective combining
character sequences, it finally occurred to me that:

U+034F COMBINING GRAPHEME JOINER

has the requisite properties.

It is non-visible, does not affect the display of neighboring
characters (except incidentally, if processes choose to recognize
sequences containing it and process them distinctly), *AND*
it is a *combining mark*, not a format control.

Hence, the sequence:

    <lamed, patah, CGJ, hiriq>
       0 17 0 14

is *not* a defective combining character sequence, by the
definitions in the standard. The entire sequence of three
combining marks would have to "apply" to the lamed, but
the fact that CGJ has (cc=0) prevents the patah from reordering
around the hiriq under normalization.

Could this finally be the missing "killer ap" for the CGJ?

>
> All that said, I disagree with Ken that this is anything like an elegant
> way out of the problem. Forcing awkward, textually illogical and easily
> forgetable control character usage onto *users* in order to solve a problem
> in the Unicode Standard is not elegant, and it is unlikely to do much for
> the reputation of the standard.

I don't understand this contention. There is no reason, in principle,
why this has to be surfaced to end users of Biblical Hebrew, any
more than messy details of embedding override controls has to be surfaced
to end users in order to make an interface which will support end user
control over direction in bidirectional text.

If CGJ is the one, then the only *real* implementation requirement would
be that CGJ be consistently inserted (for Biblical Hebrew) between
any pair of points applied to the same consonant. Depending on the
particular application, this could either be hidden behind the
input method/keyboard and be actively managed by the software, or
it could be applied as a filter to an export format, when exporting
to contexts that might neutralize intended contrasts or result in
the wrong display by the application of normalization.

>
> Q: 'Why do I have to insert this control character between these points?'
> A: 'To prevent them from being re-ordered.'
> Q: 'But why would they be re-ordered anyway? Why wouldn't they just stay in
> the order I put them in?'
> A: 'Because Unicode normalisation will automatically re-order the points.'
> Q: 'But why? Points shouldn't be re-ordered: it breaks the text.'
> A: 'Yes, but the people who decided how normalisation should work for
> Hebrew didn't know that.'
> Q: 'Well can't they fix it?'
> A: 'They have: they've told you that you have to insert this control
> character...'

And that whole dialogue should be limited to the *programmers* only,
whose job it is then to hide the details of how they get the
magic to work from people who would find those details just confusing.

> Q: 'But *I* didn't make the mistake. Why should I have to be the one to
> mess around with this annoying control character?'
>
> ... and so on.
>
> Much as the duplication of Hebrew mark encoding may be distasteful, and
> even considering the work that will need to be done to update layout
> engines, fonts and documents to work with the new mark characters, I agree
> with Peter Constable that this is by far the best long term solution,
> especially from a *user* perspective.

I have to disagree. It should be largely irrelevant to the user perspective.
In this case (as in others) the users are the experts about what their
expected requirements are for text behavior, and in particular, what
distinctions need to be maintained. But they should not be expected
to define the technical means for fulfilling those requirements, nor
lean over the shoulders of the engineers to tell them how to write
the software to accomplish it.

> Over the past two months I have been
> over this problem in great detail with the Society of Biblical Literature
> and their partners in the SBL Font Foundation. They understand the problems
> with the current normalisation, and they understand that any solution is
> going to require document and font revisions; they're resigned to this, and
> they've worked hard to come up with combining class assignments that would
> actually work for all consonant + mark(s) sequences encountered in Biblical
> Hebrew. This work forms the basis of the proposal submitted by Peter
> Constable. Encoding of new Biblical Hebrew mark characters provides a
> relatively simple update path for both documents and fonts, since it
> largely involves one-to-one mappings from old characters to new.

The alternative I've suggested is equally simple for the documents,
since it would work by inserting CGJ between any pair of the relevant
points, without otherwise changing any encoding. (Problem of an
right-side meteg aside.)

*IF* the implementations like Uniscribe do what they are supposed
to for CGJ -- which is *not* a format control, but a combining
character -- then the upgrade of existing fonts to account for
the presence of CGJ should be equally straightforward. (I don't
know if it would be *easy*.)

>
> Conversely, insisting on using control characters to manage mark ordering
> in texts will require analysis to identify those sequences that will be
> subject to re-ordering during normalisation, and individual insertion of
> control characters.

Nope, just insert CGJ in *all* the sequences. That blocks all reordering
of such sequences, and you're done.

> The fact that these control characters are invisible
> and not obvious to users transcribing text, puts an additional burden on
> application and font support,

True. But then invisible bidi controls that are not obvious to
users transcribing text also put an additional burden on
applications.

While I understand the straw on the camel's back argument, an
"insert this everywhere and hide the details from the end users"
approach wouldn't seem to be an unnecessary degree of additional
complexity.

> and adds another level of complexity to using
> what are already some of the most complicated fonts in existence (how many
> fonts do you know that come with 18 page user manuals?).

That, of course, I am in no position to be able to judge.

> I think it is
> unreasonable to expect Biblical scholars to understand Unicode canonical
> ordering to such a deep level that they are able to know where to insert
> control characters to prevent a re-ordering that shouldn't be happening in
> the first place.

The approach I am suggesting doesn't require them to know anything
about it. They can go back to what they thought they were going to
do in the first place and forget all about canonical reordering.

--Ken

Next message: Kenneth Whistler: "Re: Biblical Hebrew"
Previous message: Mark Davis: "Re: Biblical Hebrew"
Maybe in reply to: Kenneth Whistler: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Next in thread: Peter_Constable@sil.org: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Reply: Peter_Constable@sil.org: "Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)"
Reply: Philippe Verdy: "Re: Biblical Hebrew (U+034F Combining Grapheme Joiner works)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jun 26 2003 - 22:35:26 EDT