Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem

From: Gregg Reynolds (unicode@arabink.com)
Date: Mon Aug 01 2005 - 23:16:03 CDT

  • Next message: Gregg Reynolds: "Re: Jumping Cursor. Was: Right-to-Left Punctuation Problem"

    Kenneth Whistler wrote:
    >>I assumed that "inherent" Arabic bidirectionality was
    >>invented in the wee hours of computer history, maybe in the early
    >>sixties, so it never occurred to me that anybody on this list might take
    >>it personally.
    >
    >
    > Dear me, unexamined presuppositions can be a problem, can't they? ;)
    >
    Yeah, but they often come in very handy.

    > Visual order Arabic and Hebrew implementations on computers were
    > probably "invented" in the 70's, and saw fairly widespread use
    > in that timeframe on mainframes and later in the 80's on PC's. A
    > lot of that work was done by IBM. An inherent bidirectionality

    I figured it was IBM, but I would have guessed the 60's. Now the
    question is, why would one go for an MSD encoding design? My
    speculation is that, as computation was expensive in those days, they
    didn't want to mess with the math routines, and the encoding was
    probably motivated primarily by number crunching (Banks, etc.) rather
    than text processing.

    > algorithm was invented at Xerox PARC in the 80's, I think, although
    > others might have had an earlier hand in it. It was implemented
    > on the Xerox Star system in that timeframe. You can see it
    > discussed in Joe Becker's 1984 Scientific American article, for
    > example. And that was the immediate precursor of Arabic and Hebrew
    > support on the Macintosh, as well as the inspiration for the
    > Unicode bidirectional algorithm.
    >
    > [Some historians on the list can, no doubt, nail this stuff down
    > more precisely...]

    That would be very interesting. I hope they do.

    >> I really do
    >>not understand the assertions that e.g. rtl digits would be a big
    >>problem, for reasons that I've explained on other messages. Which makes
    >>me think there's something I'm overlooking. That's all.
    >
    >
    > Yes, you are.
    >
    > Cloning *any* common characters -- let alone all the digits, all
    > the common punctuation, and SPACE -- on the basis of directionality
    > differences, *would* wreak havoc on information processing. Many
    > of the characters in question are in ASCII, which means they
    > are baked into hundreds of formal languages, thousands of protocols
    > and 10's of thousands of programs and software systems. They have
    > been for decades now, and that *includes* Arabic and Hebrew
    > information processing systems.
    >
    > Making the SPACE character in Arabic and Hebrew be something *other*
    > than U+0020 SPACE, simply because it might make bidirectional
    > editors easier to write if all characters were inherently RTL for
    > Arabic, would have the effect of breaking nearly all Arabic
    > and Hebrew information processing, deep down in the guts where
    > end users can't get at it. The *only* way around it would be to

    Hmm. I guess I'm still in the dark. Existing implementations would
    still process "legacy" Unicode correctly, no? If new characters are
    added - any new characters - software must adjust, *if* it wants to.
    After all, Unicode does not require support of any particular block. So
    why not let the market decide?

    Isn't what you're saying a bit of a way of picking winners? That is, in
    my naive way I assume that a well-designed piece of Unicode software
    could easily adapt to new characters of whatever ilk. Bad software will
    have a harder time. Software makers that want to service the RTL
    language market may adapt to RTL 0-9 etc., and users may buy their
    software. Software that doesn't care will just say "we don't support
    that, just like we don't support Thai, or Limbu, or etc." Software
    makers that want to service the market can also just stick with legacy
    Unicode digits. Let the buyers decide which products they prefer. I'm
    confident that Arabic software that didn't have the cursor weirdness
    imposed by Unicode would find a ready market. More importantly, we
    would see much much more Arabic enabled software without the bidi
    requirement. Make that much much much more.

    > introduce such things effectively all pre-deprecated with canonical
    > equivalences to the existing characters, so that at least normalized
    > data would behave correctly and be interpreted correctly. But then
    > there would be no supportable reason for introducing them in
    > the first place.
    >
    > And you haven't thought through the consequences of having duplicated
    > digits with different directionality. You might think an end

    Ahem, I think I have, at least for applications like a word processor.
    But I don't have enough experience with the kinds of things you mention
    below, computers passing text around, to really judge the impact. I
    don't see any big problem, since they would be Unicode codepoints with
    well-defined semantics, but I don't really have enough experience in
    that area to judge.

    > user has complete control over what they do, with their keyboard
    > and their choice of characters -- but text is now *global* data,
    > and much of what goes on with data is automated, and consists
    > of programs talking to programs through protocols. Once you unleash
    > different users using what claims to be the *same* character
    > encoding, but with opposite conventions about *which* digits they
    > use and what direction those flow, you will inevitably get
    > into the situation where one process or another cannot reliably
    > tell whether "1234" is to be interpreted a 1234 or 4321.

    I don't see that. First of all, 3-RTL and 3-LTR are not the same
    character. They look alike, and they are classified as numbers, but
    that's all. The new 3-RTL is just another Unicode character with
    various properties, just like any other.

    If you get the string 1234 in LTR digits you know the first digit is the
    MSD and it should be typeset at the extreme left of the string. If they
    are RTL digits then the first digit is the LSD, and it must be typeset
    at the extreme right. Where is the havoc? Note, BTW, that Unicode
    stipulates a polarity by default. It doesn't (but should) stipulate
    that the first digit in a digit sequence is the MSD. So adopting RTL
    digits would be accompanied by making this explicit - LTR digit strings
    are MSD first, no matter what script they are in, and RTL digit strings
    are LSD first. Even if Unicode doesn't want to indicate mathematical
    values for the characters, applications can know. Again, no different
    than the current state of affairs, where applications must know how to
    typeset Unicode chars based on their properties.

    Malformed strings - mixing RTL and LTR digits - would be a problem, but
    no more than any other malformed string, like a digit string with latin
    chars interspersed, or too many decimal points, etc.

    As far as keyboarding is concerned, well, the user doesn't know now how
    character points are stored, why should it be any different with new RTL
    chars?

    This gets to an aspect of Unicode that I haven't personally seen
    articulated, namely that it stipulates not only glyphs but also line
    composition. It is as much a typesetting standard as a character standard.

      That alone
    > is enough for the whole proposal to be completely dead in the water.
    > All the proposal would accomplish is to create massive ambiguity
    > about what the representation of a given piece of Hebrew or
    > Arabic text should be -- and that is a *bad* thing in a character
    > encoding.

    I still have trouble seeing where the ambiguity is. If you can tell me
    exacly what is ambiguous I would appreciate it. Each character has a
    semantics - e.g. the number three - a glyph, and a typographic rule.
    This is no different than any other character in Unicode. If you see a
    glyph "3", it means three. If you see it to the left of another digit
    in an RTL context, it means "3 x 10^2"; ditto for an LTR context. I
    agree that if there is some ambiguity there that would be bad; I just
    don't see the ambiguity. If I've misunderstood something - not
    unlikely, as it all seems quite simple to me - I hope you can enlighten me.

    The only real problem I see is mixing RTL and LTR digits, but that would
    be easily handled, as it only affects typesetting.

    >
    >
    >>Then again, I
    >>really do not understand why anybody would think RTL languages are
    >>inherently bidi, so maybe there's no point
    >
    >
    > Well, first of all, nobody has claimed that the Arabic *language*
    > is inherently bidi. Nor has anybody claimed that the Arabic *script*
    > is inherently bidi. So try understanding what the people implementing
    > these systems *are* claiming.

    Er, page 42 of the Unicode Standard:

    "In Semitic scripts such as Hebrew and Arabic, characters are arranged
    from right to left into lines, although digits run the other way, making
    the scripts inherently bidirectional."

    Now, call me nutty, but it looks to me like the official Unicode
    position is that RTL "scripts are inherently bidirectional." There are
    similar passages elsewhere. If this does not reflect the actual
    semantics or intention of Unicode, by all means let's change this text.
      It is untrue (meaningless, actually) and harmful, insofar as it
    perpetuates a fundamental misunderstanding.

    >
    > Any functional information processing system concerned with
    > textual layout that is aimed at the Hebrew or Arabic language
    > markets *must* support bidirectional layout of text. That is
    > simply a fact.

    Oh come now. That is patently untrue. Or rather, it is a judgement
    about sales possibilities. In my opinion a word processor that did
    Arabic only, but did it very well, could do quite well. I don't think
    Unicode should be in the business of picking winners.

    I sure wish we had some evidence about what users actually want. Sure,
    pragmatically what you say may be true, but then what that really means
    is such software "*must* support Unicode". But that's because of
    Unicode's market clout, not because of its virtues *from the user
    perspective*. And that may be simply a fact for a big multinational.
    But what about the little company in Cairo that wants only to serve the
    Arabic market? Why should they have to worry about bidi? The point is
    that with a few additional codepoints life would be much easier for
    them, which would make life easier for the RTL community as a whole. It
    would be vastly more easy to port open source software from e.g. English
    to RTL languages if we could dispense with the bidi requirement.

    One of the more harmful myths occasionally propagated about Arabic et
    al. is that users of such RTL software use, or need, or must have, etc.
    support for LTR latinate text. I have yet to see any evidence in
    support of this assertion.

    Ordinary users of RTL software have no need of bidi support. That
    requirement comes from Unicode and multinationals who want to localize
    generic software for the least money. It doesn't come from the users.
    Naturally, I only have personal experience as evidence. I am unaware of
    any scientifically valid survey of user needs in the RTL world. But I
    can tell you in my experience lack of LTR latinate support would be no
    great loss. Of course there are niche markets where it is required,
    just as there are niche markets in the West that require RTL support.
    But the vast majority of documents in the Arab world get along just fine
    w/out latinate characters. Furthermore, they *want* to get along in
    Arabic only.

    Take a look at Arabic websites. Even those with international
    multilingual audiences use Arabic almost exclusively. For the content
    of articles, you virtually never see latin characters. Arabic gets
    along quite well with Arabic acronyms (like TCP/IP = تي سي بي/ أي بي
    (There is is again; trying to type that little bit of Arabic with parens
    that work defeated me. Ridiculous.) Take Al-Jazeera for example. I
    would estimate 99.99% of the site is in Arabic.

    >
    > Furthermore, to do so interoperably -- that is, with the hope
    > that Implementation A by Company X will lay out the same underlying
    > text as Implementation B by Company Y in the same order, so that
    > a human sees and reads it as the "same" text -- they depend on
    > a well-defined encoding of the characters and a well-defined
    > bidirectional layout algorithm.

    Not if they use only monodirectional characters. They only need
    well-define encoding, not bidi. That's the whole point. You simply do
    not need bidi to do Arabic, given a sufficient repertory of RTL
    characters. Sure, you have to have well-defined characters - glyph and
    typographic rules - but not bidi. And this isn't just theory. You can
    do Arabic just fine in Vim w/out bidi. And if you're a little nutty,
    you can do Arabic just fine in Emacs (I do it all the time) which lays
    out lines LTR, but words RTL. And you can run monodirectional Arabic
    (latin transliteration or not) through TeX or Omega and come out just
    fine. Conclusion: there is no need for bidi in order to support RTL
    languages. It is purely an artifact of legacy encoding, with no
    demonstrated need for it from the broad user community.

      One possible choice is consistent
    > visual ordering. One possible choice is consistent logical ordering
    > and an inherent bidirectional algorithm. The Unicode Standard
    > chose the latter, for a number of very good reasons. Trying
    > to mix the two is a quick road to hell.

    That's exactly my point. Mixing is the road to hell. Mixing = bidi.
    Non-mixing works for English, for which Unicode imposes no bidi
    requirement. Why are RTL languages/scripts single out for this special
    treatment?

    Thanks for the response. It helps, although as I've noted, I don't see
    any insuperable problems, and certainly not havoc. Maybe we're actually
    talking about two different things. I have a sneaking suspicion that
    you and I may be working from different definitions of some of this stuff.

    Sincerely,

    -gregg



    This archive was generated by hypermail 2.1.5 : Mon Aug 01 2005 - 23:19:32 CDT