Fw: Questions on ZWNBS - for line initial holam plus alef

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 06 2003 - 19:41:41 EDT

  • Next message: John Cowan: "Re: Questions on ZWNBS - for line initial holam plus alef"

    On Thursday, August 07, 2003 1:13 AM, Kenneth Whistler
    <kenw@sybase.com> wrote:

    > Well, yes, which is why I have been advocating it as the
    > solution to the Biblical Hebrew text representation problem.
    > I agree with you about that. But it need not be characterized
    > as "legal" in opposition to the other examples I cited above.
    > All of these sequences are "legal" and allowed by the
    > standard.

    Once again sorry if I used the terms "ill-formed" or "well-formed"
    instead of "defective" or "non defective" (normal?). Such distinction
    in the standard does not help its understanding when discussing
    about interoperability of text processing where neither ill-formed
    nor defective sequences should be used if interoperability is the
    main focus (and also normally the design focus for Unicode).

    The canonical equivalences (NFC, NFD, canonical ordering) is
    needed now for XML processing and in fact it greatly reduces
    the number of ill-formed, invalid, or defective sequences or
    whatever bad encoding of actual text, to simplify its processing.
    Still these equivalences don't solve all the issues and create their
    own (and this is now a good reason to use CGJ to override the
    canonical ordering of combining diacritics).

    Of course there may be a lot of strings created with Unicode
    which are not "ill-formed" and not canonically equivalent (per
    NFC, NFD, canonical ordering), but I won't enter in that zone.
    For XML what is relevant is that it processes strings in NFC
    form and thus implies only canonical equivalences, but XML
    will still process "defective" sequences by correctly
    processing characters per its canonical combining sequences.

    I'd like to see a more formal rule for defective uses of CGJ used
    to fix canonical ordering. What I suggested was to specify that
    only some sequences with CGJ would be "non defective", if
    the CGJ appears before a base character or between two
    combining characters. The character model needs then to be
    refined to be more precise to document which uses are
    considered non defective, and which ones are not.

    So a sequence <..., ring above, CGJ, cedilla, ...> would
    not be defective as it fixes the canonical ordering, even if
    in this case it does not interact graphically (note that this
    statement supposes that the cedilla effectively appears
    below, something which is wrong with some languages,
    where the cedilla appears in fact like an acute accent
    above right...).

    The example of the effective rendering of diacritics at the
    presupposed placement indicated by their combining class
    is significant: it shows that combining classes just handle
    some common placement rules, but not every case, and
    a particular language or renderer may need to place
    diacritics on other positions, in which case the canonical
    ordering would have an impact on the renderer. That's a
    good enough reason to justify and document the use of
    CGJ as a combining class override for diacritics, whose
    usage should be restricted for interoperability.

    This has a consequence for input methods and editors:
    users can type base characters and diacritics, and the
    editor will, by default, use a canonical ordering, that the user
    may fix if needed for a particular language with a control
    command that would "swap" two misplaced diacritics by
    automatically inserting a CGJ only if needed because both
    diacritics have distinct combining classes: this editor control
    command would have no other effect if executed after two
    diacritics with identical combining, or after a single diacritic,
    and the editor should make its best effort to not allow user
    enter ill-formed or defective sequences.

    -- 
    Philippe.
    Spams non tolérés: tout message non sollicité sera
    rapporté à vos fournisseurs de services Internet.
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 06 2003 - 21:37:27 EDT