Re: Arabic/Hebrew coding for the Mac

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Jul 06 2003 - 13:10:50 EDT

  • Next message: Ben Dougall: "Re: Arabic/Hebrew coding for the Mac"

    On Sunday, July 06, 2003 4:33 PM, Theodore H. Smith <delete@elfdata.com> wrote:

    > Hi people,
    >
    > I'm having a real headache trying to code for Arabic/Hebrew text
    > editing for the Mac. I'm developing a generalised editfield, of course
    > I'd like it to edit Japanese too, but Japanese can run in the English
    > direction, so that makes it a lot simpler!
    >
    > I really am in need of some kind of forum or group with programmers
    > specialising in Arabic or Hebrew development on the Mac. I am running
    > into problems that seem very hard to solve, especially with the lack
    > of information and coverage on this.
    >
    > The problem may be conceptual, I am having a hard time visualising
    > ATSUI's (MacOS's unicode display/edit API) modelling sometimes. I'm
    > finding in fact visualising reverse direction text a headache.
    > Especially seeing as I haven't seen code for working with it like a
    > proper text editor should do, yet.
    >
    > Things like how does the paragraph selection work? This I find really
    > very awkward... Heres an example (pretend these numbers and letters
    > are Arabic, \r means return, ignore the spaces):
    >
    > In RAM: 12356789 \r abcdefg \r xyz
    >
    > Screen: 987654321 \r gfedcba \r zyx

    Actually on screen you see the following order:

    In RAM: 12356789 \r1 abcdefg \r2 xyz \0
    Screen: \r1 987654321 \r2 gfedcba \0 zyx

    (I added an indice to each \r, and symbolised the end of text with \0)

    In reality, your editor would store the paragraphs in a list:
    In RAM: { "12356789 \r1", "abcdefg \r2", "xyz" }
    Screen: { "\r1 987654321", "\r2 gfedcba", "zyx" }

    > So if I select the onscreen from 2 to \r, which looks perfectly in
    > order, then I am actually selecting a discontinuous range!

    Depends on which \r you select.
    - If this is \r1, the user really selects the fragment ""\r1 98765432" from the first line
    - If this is \r2, the user really selects the same fragment from the first line and the second line.
    So there is no apparent discontinuous selection.
    I suppose that you are working with Arabo-Indic digits.

    > But that
    > doesn't seem right! There should be no need to, right? Is an arabic
    > UTF document really backwards? So the first character in an arabic UTF
    > document is actually the last onscreen?

    Yes, if the documnt is composed with Unicode, but there also exists some legacy encodings for Arabic and Hebrew that allow you to encode the text in the "visual" order instead of the logical order. This "visual" order is a misnomer, as this just relates to your common reading direction.

    The real complex cases don't occur with a text using only Hebrew or Arabic, but with text mixing both directions.
    Notably for digits, if the user enters a number using the European digits. In this case you have this representation:

    In RAM: { "12356789 \r1", "abcdefg \r2", "xyz" }
    Screen: { "12356789 \r1", "\r2 gfedcba", "zyx" }

    (note that the digits are not reversed on screen, and the final \r inherits the direction of the previous European digits.)
    But the selection direction is not complex with your example.

    Complex examples occur when both directions are used in the same paragraph (here uppercase letters are for example Latin and left to right or European digits, but lowercase are Herew or Arabic and right to left):

    In RAM: { "abc ABCDEFG defg \r", "xyz" }
    Screen: { "\r gfed ABCDEFG cba", "zyx" }

    Note that your \r end-of line marker always inherit the directionality of the previous encoded character in the Unicode string.

    The complex selection case occur only when the start and end of the selection are not positioned after a character with the same directionality, for example selecting from b|c to E|F (where the | symbolizes the selection position between two characters).
    Such selection seems "non-natural" but it must be performed according to the directionality of each line, which is indicated by its
    first character (if the line is not empty), or the directionality inherited from the previous line, or a default direction for the edited whole text if this is the first line of text.

    This problem occurs very frequently with European digits and Latin names inserted within a Hebrew or arabic text. Generally users are given a choice to specify which selection is most natural:

    - the visual selection mode selects a contiguous area on screen, but a discontinuous one in RAM, and generally used for mouse selections, so it will select {"FG c"} on screen, which represents on screen a discontinuous area in RAM, with at most 2 segments (embedded segments within a larger segment of the same direction can be safely ignored): {"c"+"FG"}

    - the logical selection mode is normally the default for selection with the keyboard, and occurs according to the normal caret moves in the logical order, so that the RAM selection is contibuous but not the visual selection on screen. This logical selection mode is generally the most "user-friendly" as it will select {"c ABCDE"} in RAM, which represents two segments on screen {"ABCDE" +"c"}, which the user reads in logical order as {"c ABCDE"}, which would be what the user really wants to select. Here also there can be only two segments at most.

    - there may be preferences in your program to allow your user set the correct mode for mouse or keyboard selections.

    This selection mode has to take into account the case of some "contextual" characters, notably for the begining and end of the selection, as the selection may occur between two segments with distict direction (for example when selecting several words between spaces: depending if these spaces are included or excluded, they would normally inherit the directiona of the previous logical character in the Unicode string)...

    > What really is going on here? What should be going on? Is my example
    > wrong? I really can't do any coding on reverse direction text till I
    > work out how these paragraphs are meant to be working. Its all really
    > rather twisted to me.

    You're not wrong. Text selection is just more complex in BiDi text than in unidirectional text. But if your text does not mix scripts, or only imports characters and symbols that can be neutral (such as mirrored characters), there's neer any difficulty when working with Hebrew or Arabic text (except that the text should preferably be aligned on the right margin rather than the left, but this is generally determined in a text editor by the direction of the first character in your text, unless overriden with a BiDi control character or a forced editor mode for the margin).

    > I do appreciate the problems here, in fact I made a theoretical
    > exploration and found that if we ordered in visual order instead of
    > logical, we'd just get a different set of headaches, not really less.
    > Such is computing for real-world problems!

    Neither Logical or Visual order is less complex: they both have the same "problems" (I'd prefer to speak about complexities, related only to the usage of mixed scripts).

    > Reply directly to me if you can please? At delete@elfdata.com

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Sun Jul 06 2003 - 13:51:38 EDT