Re: Is it true that Unicode is insufficient for Oriental languages?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 24 2003 - 05:13:25 EDT

  • Next message: Karl Pentzlin: "Dutch IJ, again"

    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    > From: "Kenneth Whistler" <kenw@sybase.com>
    > > Philippe Verdy continued:
    > > > I bet that such extended and more general 2D
    > > > composition/representation model will appear in a
    > > > future version of Unicode to avoid infinite growth of
    > > > the Unicode codepoints needed to represent text or
    > > > technical publications,
    > >
    > > I'll take that bet. Cash on the barrelhead?
    >
    > I'll take some of that action, too. Not since W.O. have we had someone
    > around who has been so insistent that Unicode is missing the requirements of
    > its users, without really understanding what The "Unicode way" is....

    That's not what I request. I don't want to change the way Unicode or ISO10646 allocates new code points. But there already exists some features that have been added to start describing some layout functions in Unicode, because they are thought to transport a semantic that is needed to aoid that the Unicode encoded text looses important information.

    I'm NOT supporting other proposals about what Unicode "should" have been. I DO recognize that all what has been done was needed and motivated to allow the interoperability of legacy systems through a common interchange system and a set of documented compliant formats.

    Just look at some existing features that use format control characters to control the rendering of text or its interpretation: soft-hyphens, "invisible" mathematical operators, ideographic description characters... Why wouldn't there be some general layout control characters whose effective rendering in applications would depend on capabilities of the renderer? There is, I think, some place to define such layout control characters whose usage would go largely beyond the case of mathematics or physics formulas, without necessarily implying the definition of a XML-based markup system.

    I don't think it is stupid to consider a more formal system to surround formulas or portions of text that are clearly intended to not represent linguistic text. There's an application of this for music too: it's quite difficult to find a way to encode properly music partitions, despite it closely ressembles to a form of linear text that can be spanned and rendered in lines, in a way that is much more readable for musicians than using A,B,C#,Db notations, and the recent addition of glyphs for western musical symbols (I don't think they are really "characters" as they don't have any clear semantic when they appear isolated out of a special markup system or without using many additional PUA characters which are not interoperable) seems quite useless for any type of text rendering without encoding the tone by adding also some combining character or sequences.

    You will object that we can always use a markup system. But why then Unicode had to define and support invisible mathematical operators: it is said that the conversion from a markup system to plain text would loose semantics needed in applications. If I just consider this argument, I can say the same thing for all markup systems, because Unicode recognizes that markup systems are not universal or can be interchanged more easily using Unicode as a pivot encoding (or format...) between otherwise very different markup systems. For now, it seems that music data interchanges is better performed with the MIDI code standard than with Unicode, but there are many applications (notably for publishing) where the MIDI format alone is not enough (notably because MIDI cannot transport textual annotations, or titles, or text synchronized with music and written under each line of a partition).

    A formal specification that encodes explicitly within the text data its semantic and structure would certainly be a good tool. When I say that a more general 2D description system will certain appear I don't mean necessarily the strict geometric relations, but the structural semantic relations that exist between portions of text. I took the example of the encoding of matrix: I certainly don't mean that the encoded text MUST be rendered as a 2D grid (this is only one option), but it could still be rendered using parentheses if needed. What I really mean here is abstract characters transporting relational semantics, not particular glyphs or exact layouts.

    With such a matrix encoding (using some sort of "virtual parentheses pairs"), it would be possible to encode not only mathematical matrix, but also crosswords grids, chessboard configurations, or many other similar things, which could have some good default renderings without necessarily requiring some application-specific markup which strictly fixes the layout.

    There are so many applications that need this type of explicit "grouping" of otherwise unlimited runs of text into semantic units, that this merits some attention, without arguing that what I propose (or others) is antinomic with the Unicode vision of text as semantic data encoded with abstract characters.



    This archive was generated by hypermail 2.1.5 : Sat May 24 2003 - 06:03:24 EDT