Re: Arabic letters separated by markup

From: fantasai (fantasai.lists@inkedblade.net)
Date: Sat Jun 11 2005 - 11:27:31 CDT

  • Next message: Theodore H. Smith: "Re: UnicodeData.txt is invalid, flawed, broken, corrupt and wrong"

    Andreas Prilop wrote on the Unicode mailing list[1]:
    > Does the Unicode standard only deal with plain text or
    > does it also deal with text in markup languages like SGML/HTML?
    >
    > I wonder whether Arabic letters should join when they are
    > separated by markup. Here's an example:
    >
    > http://www.unics.uni-hannover.de/nhtcapri/temp/nastaliq.html
    >
    > Current programs display the letters separated by markup
    > differently: Internet Explorer 6 and StarOffice 7 join the
    > letters, but Mozilla 1.7 does not.
    >
    > Is it left to the rules of SGML/HTML to decide or
    > has the Unicode standard any opinion about this?

    In semantic markup languages like HTML, it's really the domain of the
    formatting system used to process the markup, not the markup system
    itself. [1] So, for web pages, this behavior would be governed by the
    Unicode and CSS specs. I haven't read the Unicode book cover to cover,
    but since there's an argument here, I'm guessing it's not covered by
    Unicode quite yet. :)

    Like many other people here, I think that the goal should be to make
    the text as readable as possible, even if it means ignoring some of
    the styling.

    Therefore, these are the rules I suggest:

      For characters within the same inline sequence.

       1. Shaping and joining behavior MUST NOT be affected by element
          boundaries.
       2. Ligatures, including obligatory ligatures, MUST be broken if
          the formatting rules introduce extra space between the affected
          characters (e.g. by putting a border and margin around one of
          the characters).
       3. Optional ligatures SHOULD be broken if the formatting rules
          cannot otherwise be accomodated.
       4. Obligatory ligatures MUST NOT be broken if the formatting rules
          introduce no extra space between the affected characters, even
          if this means some of the characters are rendered in the wrong
          font or as part of the wrong visual element.
       5. Combining characters MUST be rendered as the combined grapheme
          cluster if the system is capable of rendering the combination,
          even if this means some of the characters are rendered in the
          wrong font or as part of the wrong visual element. The combined
          grapheme cluster SHOULD be rendered as part of the base
          character's element, or, in the case of combining jamos, the
          initial character's element.

    I'm quite certain of #1, but as I don't have extensive background
    in this stuff, I am not so certain of the others. Comments are
    appreciated. I can ask the CSS Working Group to consider adding a
    recommendation to the next revision of CSS2.1 if there seems to
    be a consensus around a particular set of rules, and/or to refer
    to relevant parts of the Unicode standard.

    ~fantasai

    [1] http://www.unicode.org/mail-arch/unicode-ml/y2005-m06/0110.html
         username: unicode-ml ; pass: unicode

    [2] CSS determines whether an element visually behaves as a
         block or an inline or a table cell. Given the CSS rule
           * { display: inline; }
         both
           <div>ARA</div><div>BIC</div>
         and
           <span>ARA</span><span>BIC</span>
         would result in the exact same rendering.



    This archive was generated by hypermail 2.1.5 : Sat Jun 11 2005 - 15:30:26 CDT