Re: Is it true that Unicode is insufficient for Oriental languages?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri May 23 2003 - 07:35:50 EDT

  • Next message: Philippe Verdy: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    From: "Kenneth Whistler" <kenw@sybase.com>
    > > From: "Kenneth Whistler" <kenw@sybase.com>

    > > I hope Unicode will not need to redefine styled variants for
    > > ALL existing letters in defined alphabets or abjads of the BMP...
    >
    > Why stop there? I am *sure* that someday the mathematicians
    > will need Bold Fraktur Linear B Ideograms for their formulae.
    > Drat, it seems we made a big mistake in not encoding a
    > BOLD FRAKTUR STYLE COMBINING MARK..... NOT!

    I won't request such changes. The need for it will certainly come later, simply because usage of Unicode will extend the use of other scripts in many applications including maths, as other non-Latin and non-Greek scripts also have a strong history of typography with various "style" (call it "font" if you wish, but this is a computer-related term which involves a distinction added later after a long history of typography, where each publisher or writer invented their own "look" or appearance for characters, in a way similar to artistic creations) .

    Just look at some scripts (and the fact that Unicode "representative glyphs" for abstract characters reveals that most script have a lot of glyph variants, including coherent sets that one may really consider as a plain style usable for general purpose). All scripts have had considerable variations throughout their history, and some "old" forms are rediscovered for modern use after a short period where scripts were simplified only because of technical constraints (and costs) for their reproduction that are no more a problem with computer font technologies.

    This is particularly true for the beautiful Arabic and Brahmic typography, or for Latin typography from the Middle-Age where books where manually reproduced (before Gutemberg's invention required simplifying the glyph designs to facilitate the mechanical reproduction of text with metalic fonts).

    Usage of scripts in mathematics does not reflect the semantic of standard text, but just considers the glyph aspect of scripts as convenient a way to represent abstract and distinct symbols that can easily be "read" and reproduced. Some existing old script designs have this property of being simple to reproduce and recognize (Runic is a good example, but the Hebrew abjad alphabetized in Yiddish, or astronomical symbols are other good candidates), and I'm sure that the development of Unicode will facilitate their future use in mathematics (and probably physics too) as abstract symbols to designate families of related variables, or for new operators or symbolic constants (the past technical limitation to a small subset of characters and few "styles" has contributed to make Mathematic formulas sometimes complex to represent or to decipher without using long lists of definitions to describe their use or semantic).

    Even today, most websites (including Unicode.ord) cannot represent correctly mathematical formulas without using images (their representation as text makes the formula difficult to read and does not exhibit well its structure without using excessie levels of visible parenthesis).

    There's still no representation of "invisible parentheses" that could be encoded to group related entities on which an operator charactor applies, and still no model for a bi-dimensional (non linear) representation of text (with the exception of "Han Ideographic Description Characters" that have been recently introduced, but are still mostly used only as a way to encode and reproduce text using currently unencoded Han ideographs, or to allow using decomposition algorithms for custom collation or dictionnary search purpose, or as an optimization of a Han layout engine to allow creating smaller but more general fonts based on a smaller set composed strokes instead of a very large set of composite glyphs).

    When using standard parenthesis, the font layout engine has no other choice than representing formulas with these parenthesis, even if a bidimensional layout model could be safely used. When using explicit "invisible parentheses" characters, a bidimensional-capable text layout engine could use another better rendering, and a non-capable layout engine could still represent them with standard parenthese glyphs. This would not require changing the encoding for each type of text renderer, and would not require to use upper-level markup system (which also requires a specific syntax and escaping mechanisms, notably for "<" in XML and MathML).

    For now the only 2D specification concerns the composition model for diacritics (with sorted combining classes). The Korean Hangul composition system has not been really encoded without adding all precomposed sequences of modern syllables, but also precomposed sequences of Choseong/Jungseong/Jongseong Jamos (each treated as one Jamo character, despite they are really composed graphically in their representative glyph, vocally in their phonetic and even in their canonical name): this is just a simplification that avoided specifying a more general 2D composition and representation system.

    I bet that such extended and more general 2D composition/representation model will appear in a future version of Unicode to avoid infinite growth of the Unicode codepoints needed to represent text or technical publications, and even some normative additional database of "canonical" equivalences (Please don't warn about it! I don't mean NFC/NFD here!) between the existing precomposed set and a simplified set based on 2D composition rules.



    This archive was generated by hypermail 2.1.5 : Fri May 23 2003 - 08:31:17 EDT