Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem

From: Hans Aberg (haberg@math.su.se)
Date: Wed Aug 03 2005 - 08:19:15 CDT

  • Next message: Lucans, Gunars: "RE: Letters missing for 19th century Latvian orthography?"

    There is more than one way to do it. Mark describes the Unicode way,
    a pragmatic approach based in part on the capabilities of the
    software of the day. The trick is finding a unifying logical
    structure, holding the stuff together.

    One can, for example, have just one set of digits 0-9 to be used for
    the equivalent decimal numbers for the scripts that support them; but
    then one has to add other characters, representing the semantic
    information amiss. This could be say characters representing scripts,
    and rules for using them. In addition, appropriate software, being
    able to handle it, would have been to be developed in parallel. For
    example, the rendering direction is resolved by checking not only the
    environment itself, but also the conditions imposed by the
    surrounding environment, if present. When copying and pasting, not
    only the characters should be copied and pasted, but also the
    environment to which they belong as well. The details are left as an
    exercise.

    One can note that the various mathematical styles in typical math
    usage may not be related to ordinary numbers. For example, a styled
    "1" might be used for indicating the identity map. If say boldface
    numbers are used to denote Church numbers, these are functionals,
    though as set can also be made logically equivalent to the axioms of
    natural numbers, not the same. So the idea above would not work for
    the pure math styled digits.

    On 3 Aug 2005, at 03:34, Mark Davis wrote:

    > The choice of whether or not to clone characters was made
    > consciously. We
    > had experience with the other model: I wrote the first
    > implementation of
    > Arabic and Hebrew on the Mac back in 1986ish, and in that
    > implementation
    > cloned the common characters, giving the clones RTL directionality.
    >
    > We found many problems with this, because identical-looking
    > characters had
    > bizarre effects when cut and pasted into different fields. Arabic
    > and Hebrew
    > users are not working in a vacuum; they will be cutting and pasting
    > in text
    > from a variety of sources, including LTR sources. Cloning
    > parentheses (or
    > interpreting them according to visual appearance) meant that every
    > program
    > that analyzed text for open/close parentheses (eg regex) failed.
    > And we
    > didn't do numbers as LSDF (least-significant digit first); that
    > would have
    > caused huge problems in compatibility because software is just not
    > set up to
    > recognize LSDF numbers. And this is not to speak of the security
    > problems
    > with these clones (see http://www.unicode.org/reports/tr36/).
    >
    > Thus when it came time to do the original BIDI algorithm, we
    > decided not to
    > use the cloning approach.
    >
    > The BIDI algorithm is not an impediment to the development of software
    > globalized for BIDI. Most programs will simply use OS-supported
    > text widgets
    > that handle all the details for them. Text/Word processors can use the
    > lower-level implementations of the BIDI algorithm: there are plenty
    > of solid
    > implementations around, either supported by the OS or in libraries
    > like ICU.
    > The barriers that I have seen to people globalizing their products
    > for BIDI
    > are more the other aspects, such as dialog layout in the
    > applications, etc.
    >
    > Moreover, it would be certainly possible for a program to use
    > visual layout
    > on the screen, then translate that internal format to and from logical
    > layout for transmission as Unicode. Quite frankly, while you find
    > the BIDI
    > algorithm difficult to use, all of the other approaches had such
    > serious
    > problems that it is really the only practical approach.
    >
    > (Notwithstanding that, if I had the chance to go back in time and
    > undo a few
    > things, I would have simplified the weak processing to make numbers
    > independent of their surroundings. But that's water far, far under the
    > bridge.)
    >
    > ‎Mark
    >
    > ----- Original Message -----
    > From: "Gregg Reynolds" <unicode@arabink.com>
    > To: "John Hudson" <tiro@tiro.com>
    > Cc: "'Unicode'" <unicode@unicode.org>
    > Sent: Tuesday, August 02, 2005 17:33
    > Subject: Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem
    >
    >
    >
    >> John Hudson wrote:
    >>
    >>> Gregg Reynolds wrote:
    >>>
    >>>
    >>>> Adding to the already existing - what, 5? 6? - different ways of
    >>>> encoding each digit. Let's count the ways:
    >>>>
    >>>> 0030-0039 DIGIT ZERO etc
    >>>> 0660-0069 ARABIC-INDIC
    >>>> 06F0-06F9 EXTENDED ARABIC-INDIC
    >>>> 0966-096F DEVANAGARI
    >>>> 09E6-09EF BENGALI
    >>>> 0A66-0A6F GURMUKHI
    >>>> 0AE6-0AEF GUJARATI
    >>>> Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Tibetan,
    >>>> Myanmar, Ethiopic, Khmer, Mongolian, Limbu, Osmanya, various
    >>>> mathematical digit characters, Japanese full-width, etc. etc.
    >>>> Twenty
    >>>> one and counting.
    >>>>
    >>>
    >>>
    >>> Most of which look different, some of which function differently
    >>> (i.e.
    >>> use different counting systems that do not correspond to our decimal
    >>> digit system). I don't think there is any expectation that one
    >>> would be
    >>> able to perform cross-script arithmetic using Mongolian and Ethiopic
    >>> numeral characters. What you are proposing is something quite
    >>> other: two
    >>> ways of encoding the *same* numerals. Your new numerals would
    >>> look the
    >>> same, represent the same numbers, need to be considered the same for
    >>> searches, sorts and mathematical functions. They would be, in
    >>> fact, the
    >>> same characters encoded twice.
    >>>
    >>>
    >>
    >> Ok. I agree that is a valid observation. I think, anyway. I
    >> have to
    >> ponder it a bit more. I think it depends on what the meaning of
    >> "same"
    >> is. Aren't 0030-9 and 0660-9 really the "same"? My understanding of
    >> unicode is that it doesn't address these semantics - 0-9 are just
    >> characters, not mathematical signs. (The fact that the have "number"
    >> property only means they all have the same formal category, not that
    >> they denote mathematical values; it could just as easily have been
    >> called the "fdsaflkh" property. It's up to a higher level
    >> protocol to
    >> interpret "fdsaflkh" characters as mathematical signs.)
    >> Mathematically,
    >> any characters that denote the mathematical values 0-9 may be
    >> considered
    >> "the same", regardless of graphical form. The latter is a mere
    >> matter
    >> of implementation (font) technology.
    >>
    >>
    >>> But this is the kicker, as already mentioned yesterday: *all* those
    >>> numerals characters you listed share the same directionality, and
    >>> all
    >>> numbers in Unicode are encoded most-significant digit first.
    >>> Maybe if
    >>>
    >>
    >> Well, typographically they are all LTR, but that is completely
    >> orthogonal to encoding syntax (polarity). It occurs to me now that
    >> you've put your finger on the problem. Which is, that these
    >> "characters" should in fact be treated as characters, and not
    >> mathematical signs, in order to be consistent (ha!) with Unicode
    >> principles. Mathematical interpretation comes in at a higher
    >> level protocol. This is consistent with Unicode design
    >> principles, as I
    >> understand them. So assume that RTL 0-9 are just another set of
    >> characters, w/out mathematical semantics, that all happen to have a
    >> property called "number". They will be treated no differently
    >> than any
    >> other RTL character w/r/t typesetting; w/r/t to math routines,
    >> they will
    >> be treated no differently than any other "number" characters (math
    >> routines must merely interpret polarity correctly.) In fact,
    >> there is
    >> no need to stipulate any graphical form. (I note that MSWord happily
    >> changes the form of numeric digit characters from European to Arabic
    >> Indic based on user preferences. Does it change the underlying
    >> encoding? Dunno, never checked.)
    >>
    >>
    >>> computing had been invented in the Middle East it would be the
    >>> other way
    >>> around, with the least significant digit encoded first, and the
    >>> various
    >>> standards would oblige all LTR writing systems to function
    >>> bidirectionally with regard to numerals.
    >>>
    >>
    >> But the point is that absolute directional is not the only design
    >> choice. We would get along just fine with relative polarity
    >> (relative
    >> to writing direction, that is.)
    >>
    >>
    >>>
    >>> Now, when it comes to things like parentheses, the mirrored stuff
    >>> does
    >>> my head in and I really don't see the point of it. I'm guessing
    >>> that it
    >>> confuses application developers also, since it is implemented
    >>> with so
    >>> little consistency.
    >>>
    >>
    >> You can say that again. But in this respect Unicode is already
    >> obsolete. The only justification I can see for ambiguous
    >> directionality, mirroring, etc. is trying to save space (code
    >> space, I
    >> mean). Fifty years from now (or ten?) chars will be 64 bits, with an
    >> essentially infinite code space, so there will be no justification
    >> for
    >> either unification or directional ambiguity.
    >>
    >> -gregg
    >>
    >>
    >>
    >>
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Aug 03 2005 - 13:46:57 CDT