Re: minimizing size (was Re: allocation of Georgian letters)

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Tue Feb 12 2008 - 06:20:18 CST

  • Next message: André Szabolcs Szelp: "Re: Combining marks with two letters"

    Canonical forms need to be evaluated on their merits
    Canonical forms need not be evaluated as an alternative to duplicate
    encoding.

    1/ The real-fixedwidth requirements for Tamil are vastly differeent to any
    other indic system.
    The width are extremly large and can not be accommodated as it stands. For
    this reason canonical form is essential. (Canonical solution for this issue
    is not at all relevant to TUNE or TANE.). Even TANE and TUNE need to be
    modified for this purpose.

    2/ There are canonical forms of another kind (similar to TANE/TUNE) that can
    help us resolve the publishing issues presently encountered. This need
    atleast a recognised-temporary solution (not PUA), as publishing is the
    number one item, where Tamil computing needs support from Unicode.

    3/ Forward compatibility (with and without CTL) would be needed, as this
    will become defacto Tamil in the longrun.

    hence, canonical forms are requirements, coincidently matches with some oin
    the TANE/TUNE model.
    Canonical requirement is not to be taken as duplicat encoding,
    There is already a canonical form defined for Tamil (au-marker) , we need to
    expand on this.

    Sinnathurai

    >>>>
    Doug Ewell * Fullerton, California, USA Wrote
    <snip>
    .... "The Unicode standard follows the ISCII (Indian Standard Code for
    Information Interchange) code standard in treating all nine of the
    official Indian scripts (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
    Tamil, Telugu, Kannada, and Malayalam) in a parallel way."

    and

    "The graphemic syllable is built up of alphabetic pieces, the actual
    letters of the Devanagari script. These consist of three major types:
    consonants, dependent vowels, and independent vowels."

    This is the reason why Tamil is encoded in Unicode the way it is.
    Whether or not anyone agrees that it should have been encoded that way
    is a different matter.

    <snip>
    ...
    Read the ISO "Principles and Procedures" document at
    http://www.dkuug.dk/JTC1/SC2/WG2/docs/n3102.pdf to see why duplicate
    encodings are no longer allowed. Reinventing TUNE as a question of
    "canonical forms" and "non-canonical forms" doesn't change this. If you
    want software to work with a different Tamil model, use the PUA.

    ----- Original Message -----
    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Unicode Discussion"
    <unicode@unicode.org>
    Sent: 09 February 2008 18:21
    Subject: Re: minimizing size (was Re: allocation of Georgian letters)

    >>>>
    > The current Tamil Unicode need not be depricated. The current Tamil
    > Unicode can become the true Tamil Unicode, without CTL, yes the current
    > without CTL in say 100 years time, as the current is in sync with ancient
    > scientific Grammar (without CTL). This is what I meant by FORWARD
    > COMPATIBLE, not even backward compatible.
    >
    > What we need now is making canonical forms, making canonical forms as
    > secondary while keeping Current as primary. ie, both will work in harmony.
    > Canonical forms will make it possible for contemporary Tamil to work
    > without CTL.
    >
    > Sinnathurai
    >
    >
    >>>
    > From: "Bala" <bala@cse.mrt.ac.lk> Wrote
    >
    > However Unicode were very clear in the Chennai meeting that dual encoding
    > is not possible and present encoding cannot be deprecated as well.
    >
    > Sinnathutai
    >
    > ----- Original Message -----
    > From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    > To: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>; "Unicode Discussion"
    > <unicode@unicode.org>
    > Sent: 09 February 2008 16:09
    > Subject: Re: minimizing size (was Re: allocation of Georgian letters)
    >
    >
    >>>>>>>
    >> A backward compatible solution would include allowing the current (CTL)
    >> encoding to work while canonical forms would allow non-CTL encoding.
    >> Both can work seemlesly.
    >>
    >> Tamil already has a canonical form (though wrongly) defined already. We
    >> need to expand on providing canonical forms in a logical manner.
    >>
    >> Forward Compatible:
    >> One more thing, criteria used for making Devanagari nor Bengali may be or
    >> may not be correct in making them CTL encodings. However, the criteria
    >> used to make Tamil a CTL is not correct. Ofcourse backward compatibility
    >> (even forward compatibility can be considered) must be maintained. Is a
    >> non-CTL solution, that is made forward-compaticle to CTL is a
    >> possibility? Can I expect a solution for rigidly fixedwidth requirements
    >> using canonical forms? Can I expect an IMMEDIATE solution to use Unicode
    >> in publishing, utilising canonical forms? The existing will encoding
    >> stays primary, while canonical forms assist in special circumstances?
    >>
    >> Sinnathurai
    >>
    >>>>>>
    >> On Feb 8, 2008 4:12 PM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
    >> wrote:
    >>> Again what is the criteria for stopping Tamil using workable solution
    >>> and
    >>> what is the criteria for enforcing non-working solution?
    >>
    >> Unicode will change its encoding of Tamil in a non-backward compatible
    >> way when hell freezes over. This system may be suboptimal, but it is
    >> the same system as used for Devanagari and Bengali, and does work, if
    >> not as well or in as many systems as you may hope.
    >>
    >>
    >>
    >> ----- Original Message -----
    >> From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    >> To: "Unicode Discussion" <unicode@unicode.org>
    >> Sent: 08 February 2008 22:02
    >> Subject: Re: minimizing size (was Re: allocation of Georgian letters)
    >>
    >>
    >>>
    >> John H. Jenkins wrote in a mail,
    >> "
    >> Even if Unicode had used an encoding model for South Asian scripts
    >> that didn't require complex rendering, the current problem would exist
    >> because then text would display correctly but, for example, databases
    >> would have to be substantially rewritten to convert the glyph stream
    >> back into a series of letters for the operations that they typically
    >> support.
    >> "
    >>>>
    >> John, could you expand on the above,
    >> (Additionally, please include what effect canonical forms would have on
    >> databses.)
    >> My initial thinking is non-CTL Tamil would work in databases without
    >> additional interventions.
    >>
    >> Sinnathurai
    >>
    >>>>>
    >>
    >> I'm not sufficiently familiar with Tamil to comment intelligently on
    >> it. My responses were aimed at the more general issue of why complex
    >> scripts are required and to try to clarify the reasons why some
    >> scripts require complex rendering and others don't. If your questions
    >> have to do with Tamil specifically, it would be better for them to be
    >> answered by someone more familiar with the script.
    >>
    >> =====
    >> John H. Jenkins
    >> jenkins@apple.com
    >> ----- Original Message -----
    >> From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    >> To: "Unicode Discussion" <unicode@unicode.org>
    >> Sent: 08 February 2008 21:12
    >> Subject: Re: minimizing size (was Re: allocation of Georgian letters)
    >>
    >>
    >> 1/
    >> My question was what is the criteria used to class a language as
    >> a/ That requires complex rendering
    >> b/ That does not require complex rendering.
    >>
    >> Tamil need not be a CTL script. It can work 100% and work better than CTL
    >> enabled Tamil. Why then is Tamil classed as CTL script? What is the
    >> criteria?
    >>
    >> For example Tamil could easily be implemented without the need for any
    >> complex rendering
    >> However, Tamil is currently implemented using complex rendering.
    >> This was one of the main discussions and I have not seen a viable answer
    >> that catergorically states for such and such TECHNICAL reasons Tamil was
    >> made one that requires Complex rendering.
    >>
    >> as for fixedwidth,
    >>
    >> For example, in Tamil, currently the one cell fixedwidth font is acheived
    >> using 8bit encoding. It can not be obtained using Unicode as it stands.
    >> Though, introducing cannonical forms can resolve this by enabling single
    >> width for all necessary glyps.
    >> Not only Terminal emulators, there are many electronic devices, such as
    >> settop box, etc uses RIGID fixedwidth. I do not know how CJK handles
    >> settop box, etc... Anyway, why move into unnesessary complexities, when
    >> Tamil can work perfectly well as a non-CTL script or alternatively by
    >> defining canonical forms to do away with complex rendering requirements!!
    >>
    >> I think regid-fixedwidth for Tamil with it's rendered form is NOT
    >> ACHEIVABLE.
    >>
    >> As for publishing, attempt to use Unicode Tamil fails. If it is
    >> acheivable,
    >> when will it be ready?
    >>
    >> Again what is the criteria for stopping Tamil using workable solution and
    >> what is the criteria for enforcing non-working solution?
    >>
    >> I think we can atleast move fast, if we introduce all necessary canonical
    >> forms now, most of the publishing s/w may work with canonical forms.
    >>
    >> Sinnathurai
    >>
    >>
    >> ----- Original Message -----
    >> From: "Ed Trager" <ed.trager@gmail.com>
    >> To: "Unicode Discussion" <unicode@unicode.org>
    >> Sent: 08 February 2008 20:02
    >> Subject: Re: minimizing size (was Re: allocation of Georgian letters)
    >>
    >>
    >> Hi, everyone,
    >>
    >> Just a few brief comments on this thread:
    >>
    >>>
    >>> Having flown halfway around the world to talk to people who for whatever
    >>> reasons, both valid and invalid (and not really distinguishing which is
    >>> which on their list of concerns), are unhappy with a language encoding
    >>> that
    >>> in their view doubles or worse the amount of bytes used to store their
    >>> language in Unicode, I can tell you that this as very real concern on
    >>> some
    >>> people's minds.
    >>>
    >>> True or false, it is on their minds. They can all add and multiply, and
    >>> it
    >>> certainly looks like a 2x or 3x situation to them.
    >>>
    >>
    >> Of course it is on their minds! Judging from the titles of emails in
    >> my spam box, size really does matter. But apparently what humanity
    >> really wants to do is MAXIMIZE the size, not minimize it. So a 2x or
    >> 3x situation should be good. :-)
    >>
    >> On Feb 8, 2008 5:52 AM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
    >> wrote:
    >>> 2/
    >>> My question was, mostly all proper publishing softwares do not yet
    >>> support
    >>> complex rendering. How many years since Unicode come into being?
    >>> When is this going to be resolved, or do we plan on choosing an
    >>> alternative
    >>> encoding as Unicode is not working.
    >>>
    >>
    >> Unicode does in fact work very well. Implementing good Unicode
    >> support for complex text layout (CTL) scripts like Tamil is
    >> achievable. Not sure what "proper publishing software" includes --
    >> For example, would that include http://ta.wikipedia.org/ ?
    >>
    >> From an economic perspective, when the markets in South and Southeast
    >> Asia that require complex text layout look enticing enough to the
    >> software vendors, then the problem will be solved. Is it possible
    >> that rampant piracy of commercial software throughout Asia actually
    >> contributes to the problem of poor support for many Asian scripts in
    >> heavy-weight commercial software like Adobe InDesign? This question
    >> might be a great topic of some student's research paper.
    >>
    >> Clearly the commercial players like Adobe InDesign and Quark XPress
    >> and the non-commercial players like Scribus (http://www.scribus.net/)
    >> are all working on providing support for CTL scripts. In this arena,
    >> the Open Source players are influenced by a different set of driving
    >> criteria than the commercial vendors: Does being Open Source encourage
    >> faster development of non-Latin script support? This question might
    >> be a great topic for some other student's research paper.
    >>
    >> In any case, the transparency of development in the Open Source world
    >> allows one to find out exactly how things stand. For example, here is
    >> the link to Scribus' "Support for Non-Latin Languages" meta-bug page:
    >>
    >> http://bugs.scribus.net/view.php?id=3965
    >>
    >> And in the case of Scribus, for example, one is welcome to contribute
    >> well-documented test cases (sample Unicode text along with references
    >> to fonts that are know to work correctly in other software) which the
    >> developers can use for testing the software.
    >>
    >>> 3/
    >>> As for bitmap, I meant the "Rigidly-fixed-width-character" requirements.
    >>> At present, the complex rendering (which is not working yet in these
    >>> systems) will produce extremly large width glyphs which will not be
    >>> accomodated by "rigidly-fixedwidth- requirements. What is the plan to
    >>> resolve this?
    >>>
    >>
    >> The only place where "rigidly fixed width" characters are normally
    >> required
    >> that I can think of is in terminal emulators. Once upon a time I
    >> investigated the idea of creating a terminal emulator --along with a
    >> bitmap font-- that would support scripts like Myanmar (Burmese),
    >> Tamil, etc. (Actually, from time to time, I still return to this
    >> idea).
    >>
    >> In existing terminal emulators, Latin glyphs take up one character
    >> cell each, while CJK glyphs are "double-width" and take up 2 character
    >> cells each. The GNU Unifont BMP bitmap font originally designed by
    >> Roman Czyborra (http://en.wikipedia.org/wiki/GNU_Unifont) provides a
    >> good example of how this works: most of the glyphs are 8 pixels wide
    >> by 16 pixels high, but the CJK glyphs are 16 pixels wide by 16 pixels
    >> high.
    >>
    >> In the hypothetical system as I had envisioned it, glyphs other than
    >> CJK glyphs could also be double-width. And, in fact, why limit
    >> ourselves to widths of 1 and 2 character cells? When I was
    >> investigating Myanmar, I thought that it actually would be *better* to
    >> allow some glyphs to stretch across 3 or even 4 character cells.
    >>
    >> We can think of this hypothetical terminal emulator as having a
    >> cartesian grid and glyphs of all scripts need to fit into discrete
    >> "quantum" cells : 1, 2, 3, or 4. (Maybe one could even make an
    >> argument for some glyph using up 5 quantum cells?)
    >>
    >> An experienced font designer (or team of designers) would then take up
    >> the challenge of creating a font to use with this terminal emulator.
    >> The font need not be a bitmap font -- it could just as easily be a
    >> vector font. For the sake of argument, let's say we allow this
    >> hypothetical terminal to use vector fonts (i.e., we could just make a
    >> special kind of OpenType font which could even have embedded bitmaps
    >> if desired).
    >>
    >> So for the various Latin blocks of Unicode we could start out with a
    >> suitable "monospaced" font. In a Latin monospaced font, all letters
    >> fit into fixed-width cells so that the advance distances on all glyphs
    >> are the same. This obviously requires some special aesthetic
    >> compromises, especially on the wide Latin letters like "m" and "w".
    >>
    >> To this originally "monospaced" font, we would now add additional
    >> blocks of Unicode. We could pretty much continue working within our
    >> "monospaced" design mantra through many blocks of Unicode -- until, of
    >> course, we hit scripts like Devanagari, Tamil, Myanmar, Khmer, and so
    >> on. Arabic too. At this point, our originally "monospaced" font
    >> becomes no longer "monospaced". Let's give it a new name -- how about
    >> "quantized font" or "quantum spaced font"? Or simply "quantum font" ?
    >>
    >> In this new quantum font, whenever an individual glyph became too
    >> horribly "squished" to fit inside one quantum character cell, then we
    >> would automatically try a 2-cell approach, and if even that did not
    >> work, then go for a 3-or 4-cell approach.
    >>
    >> As a quick and familiar example, let's use Arabic script. On Linux,
    >> the mlterm folks (http://mlterm.sourceforge.net/) have actually
    >> produced a "multilingual" terminal that even handles RTL Arabic. This
    >> is pretty cool. Mlterm uses GNU unifont for its Arabic glyphs.
    >> Arabic in mlterm is readable, which is nice, but it is really ugly.
    >> For example, terminal ARABIC LETTER SHEEN Ř´ looks almost unbearably
    >> *squished*. Clearly, wide arabic letters like isolated or terminal
    >> ARABIC LETTER SHEEN ش or ARABIC LETTER SAAD ص would probably end up
    >> looking *much* nicer if we just allowed them to occupy 2 character
    >> cells. So, in this quantum font, most Arabic letters would still
    >> occupy just one character cell, but a few would occupy up to 2
    >> character cells.
    >>
    >> A similar principle would apply for the creation of the necessary
    >> glyphs for scripts like Myanmar and Tamil -- except in these cases
    >> there would be some glyphs that would necessarily take up 3 or even 4
    >> character cells.
    >>
    >> Well that's my idea, for what it is worth. I even tried my hand at
    >> creating a set of bitmap glyphs for Myanmar which could be added to
    >> GNU Unifont. But after wasting a lot of time on this, I realized I
    >> did not know how to write a terminal emulator. So, maybe someday I
    >> will return to this outlandish project. After I have learned how to
    >> write a terminal emulator.
    >>
    >> - Ed Trager
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Feb 12 2008 - 06:23:39 CST