Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem

From: Gregg Reynolds (unicode@arabink.com)
Date: Tue Aug 02 2005 - 10:42:30 CDT

  • Next message: Jukka K. Korpela: "Re: New Public Review Issue"

    John Hudson wrote:
    > Gregg Reynolds wrote:
    >
    >> Maybe its the size of the problem I'm not understanding. To take your
    >> example, let's suppose that RTL digits 0-9 are approved tomorrow.
    >> They're no different than their LTR equivalents, except for the
    >> typesetting semantics. That is, they share the same "underlying
    >> Platonic character", if I've understood you: they mean the number
    >> three. They just have different *typographic* semantics.
    >
    >
    > There is no concept of 'typographic semantics' in Unicode. (I'll leave
    > it to the philosophers to debate whether the Unicode notion of 'abstract
    > character' is the same as your 'underlying Platonic character'.)
    >

    Maybe "typographic" isn't the right word. It definitely if implicitly
    encodes a set of graphical syntax rules. That's what the bidi classes,
    shaping classes, combining classes, etc. encode.

    > You are proposing encoding of separate Unicode characters for RTL
    > digits. Ergo, two possible ways to encode each digit, and a major

    Adding to the already existing - what, 5? 6? - different ways of
    encoding each digit. Let's count the ways:

            0030-0039 DIGIT ZERO etc
            0660-0069 ARABIC-INDIC
            06F0-06F9 EXTENDED ARABIC-INDIC
            0966-096F DEVANAGARI
            09E6-09EF BENGALI
            0A66-0A6F GURMUKHI
            0AE6-0AEF GUJARATI
            Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan, Myanmar,
    Ethiopic, Khmer, Mongolian, Limbu, Osmanya, various mathematical digit
    characters, Japanese full-width, etc. etc. Twenty one and counting.

    I don't see why adding additional sets of digits is problematic; Unicode
    already accomodates it.

    > rewrite of existing software (including updates to the cmap tables of
    > all Arabic and Hebrew fonts) to ensure that these two sets of characters
    > are treated as if they were the same characters for numeric searching
    > and sorting. I don't see any way to do this that doesn't reimplementing
    > a major aspect of RTL text processing from scratch, with attendant
    > expense and wastage of previous work.

    Depends on the architecture of the previous work. We already have the
    necessary properties: Number and RTL. All you need to do is add
    codepoints to your internal tables. Update a few cmaps. *If* you want
    to support the new characters. That's not required, any more than
    support for Thai line breaking is required for English language software.

    More importantly, it makes it *much* easier to adapt LTR-only software
    to support RTL languages. Not to support bidi processing, mind you.
    That's the main benefit.

      Maybe it would have been a good
    > idea about fifteen years ago, but now it is an economic non-starter no
    > matter what one thinks of the virtue of the idea itself.

    Possibly; but nobody is required to implement new characters. Change is
    never free; but things that never change never improve.

    The idea is not to force vendors to support something they don't want to
    do, it is to remove constraints preventing developers from doing
    something they might like to do.

    >
    >> It is very clear to me that the only reason anybody uses such software
    >> is because they have no other choice, not because they are satisfied
    >> with it.
    >
    >
    > So improve the software. Determine correct behaviour for specific
    > characters and desired input methods and demand that applications get it
    > right. Ripping out the foundations because you don't like the wallpaper
    > doesn't make a lot of sense.

    Oh believe me it's on my todo list. The `Patacode paper I posted a
    while back is a start; a clear accounting of user interaction
    expectations is part of the project, as is a formal discussion of digit
    polarity in encoding design. Not to mention running code, which always
    wins. To be honest I remain unconvinced that RTL digits would cause the
    end of the world, or even much of a headache. Obviously I shall have to
    hack a piece of free software to support RTL digits in the PUA, to
    discover the actual costs, but it'll be a while before I get to that.

    But that's a lot of work; the reason I bring up this stuff on this
    thread is twofold: one, to get some idea of whether or not writing up a
    formal proposal to submit to Unicode would be a waste of time (looks
    like it); and two, to at least try to counteract the myths of inherent
    RTL bidirectionality and the "necessity" of non-latin software to
    support latin characters.

    (BTW, the bidi requirement is hardly wallpaper; it *is* the foundation,
    which is why it is harmful. But how does adding RTL digits amount to
    ripping out the foundation? No changes would be made to existing
    character semantics.)

    -gregg



    This archive was generated by hypermail 2.1.5 : Tue Aug 02 2005 - 10:43:28 CDT