Re: minimizing size (was Re: allocation of Georgian letters)

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Sat Feb 09 2008 - 10:09:43 CST

  • Next message: Bala: "RE: minimizing size (was Re: allocation of Georgian letters)"

    >>>>>
    A backward compatible solution would include allowing the current (CTL)
    encoding to work while canonical forms would allow non-CTL encoding.
    Both can work seemlesly.

    Tamil already has a canonical form (though wrongly) defined already. We need
    to expand on providing canonical forms in a logical manner.

    Forward Compatible:
    One more thing, criteria used for making Devanagari nor Bengali may be or
    may not be correct in making them CTL encodings. However, the criteria used
    to make Tamil a CTL is not correct. Ofcourse backward compatibility (even
    forward compatibility can be considered) must be maintained. Is a non-CTL
    solution, that is made forward-compaticle to CTL is a possibility? Can I
    expect a solution for rigidly fixedwidth requirements using canonical forms?
    Can I expect an IMMEDIATE solution to use Unicode in publishing, utilising
    canonical forms? The existing will encoding stays primary, while canonical
    forms assist in special circumstances?

    Sinnathurai

    >>>>
    On Feb 8, 2008 4:12 PM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
    wrote:
    > Again what is the criteria for stopping Tamil using workable solution and
    > what is the criteria for enforcing non-working solution?

    Unicode will change its encoding of Tamil in a non-backward compatible
    way when hell freezes over. This system may be suboptimal, but it is
    the same system as used for Devanagari and Bengali, and does work, if
    not as well or in as many systems as you may hope.

    ----- Original Message -----
    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    To: "Unicode Discussion" <unicode@unicode.org>
    Sent: 08 February 2008 22:02
    Subject: Re: minimizing size (was Re: allocation of Georgian letters)

    >
    John H. Jenkins wrote in a mail,
    "
    Even if Unicode had used an encoding model for South Asian scripts
    that didn't require complex rendering, the current problem would exist
    because then text would display correctly but, for example, databases
    would have to be substantially rewritten to convert the glyph stream
    back into a series of letters for the operations that they typically
    support.
    "
    >>
    John, could you expand on the above,
    (Additionally, please include what effect canonical forms would have on
    databses.)
    My initial thinking is non-CTL Tamil would work in databases without
    additional interventions.

    Sinnathurai

    >>>

    I'm not sufficiently familiar with Tamil to comment intelligently on
    it. My responses were aimed at the more general issue of why complex
    scripts are required and to try to clarify the reasons why some
    scripts require complex rendering and others don't. If your questions
    have to do with Tamil specifically, it would be better for them to be
    answered by someone more familiar with the script.

    =====
    John H. Jenkins
    jenkins@apple.com
    ----- Original Message -----
    From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
    To: "Unicode Discussion" <unicode@unicode.org>
    Sent: 08 February 2008 21:12
    Subject: Re: minimizing size (was Re: allocation of Georgian letters)

    1/
    My question was what is the criteria used to class a language as
    a/ That requires complex rendering
    b/ That does not require complex rendering.

    Tamil need not be a CTL script. It can work 100% and work better than CTL
    enabled Tamil. Why then is Tamil classed as CTL script? What is the
    criteria?

    For example Tamil could easily be implemented without the need for any
    complex rendering
    However, Tamil is currently implemented using complex rendering.
    This was one of the main discussions and I have not seen a viable answer
    that catergorically states for such and such TECHNICAL reasons Tamil was
    made one that requires Complex rendering.

    as for fixedwidth,

    For example, in Tamil, currently the one cell fixedwidth font is acheived
    using 8bit encoding. It can not be obtained using Unicode as it stands.
    Though, introducing cannonical forms can resolve this by enabling single
    width for all necessary glyps.
    Not only Terminal emulators, there are many electronic devices, such as
    settop box, etc uses RIGID fixedwidth. I do not know how CJK handles
    settop box, etc... Anyway, why move into unnesessary complexities, when
    Tamil can work perfectly well as a non-CTL script or alternatively by
    defining canonical forms to do away with complex rendering requirements!!

    I think regid-fixedwidth for Tamil with it's rendered form is NOT
    ACHEIVABLE.

    As for publishing, attempt to use Unicode Tamil fails. If it is acheivable,
    when will it be ready?

    Again what is the criteria for stopping Tamil using workable solution and
    what is the criteria for enforcing non-working solution?

    I think we can atleast move fast, if we introduce all necessary canonical
    forms now, most of the publishing s/w may work with canonical forms.

    Sinnathurai

    ----- Original Message -----
    From: "Ed Trager" <ed.trager@gmail.com>
    To: "Unicode Discussion" <unicode@unicode.org>
    Sent: 08 February 2008 20:02
    Subject: Re: minimizing size (was Re: allocation of Georgian letters)

    Hi, everyone,

    Just a few brief comments on this thread:

    >
    > Having flown halfway around the world to talk to people who for whatever
    > reasons, both valid and invalid (and not really distinguishing which is
    > which on their list of concerns), are unhappy with a language encoding
    > that
    > in their view doubles or worse the amount of bytes used to store their
    > language in Unicode, I can tell you that this as very real concern on some
    > people's minds.
    >
    > True or false, it is on their minds. They can all add and multiply, and it
    > certainly looks like a 2x or 3x situation to them.
    >

    Of course it is on their minds! Judging from the titles of emails in
    my spam box, size really does matter. But apparently what humanity
    really wants to do is MAXIMIZE the size, not minimize it. So a 2x or
    3x situation should be good. :-)

    On Feb 8, 2008 5:52 AM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
    wrote:
    > 2/
    > My question was, mostly all proper publishing softwares do not yet support
    > complex rendering. How many years since Unicode come into being?
    > When is this going to be resolved, or do we plan on choosing an
    > alternative
    > encoding as Unicode is not working.
    >

    Unicode does in fact work very well. Implementing good Unicode
    support for complex text layout (CTL) scripts like Tamil is
    achievable. Not sure what "proper publishing software" includes --
    For example, would that include http://ta.wikipedia.org/ ?

    From an economic perspective, when the markets in South and Southeast
    Asia that require complex text layout look enticing enough to the
    software vendors, then the problem will be solved. Is it possible
    that rampant piracy of commercial software throughout Asia actually
    contributes to the problem of poor support for many Asian scripts in
    heavy-weight commercial software like Adobe InDesign? This question
    might be a great topic of some student's research paper.

    Clearly the commercial players like Adobe InDesign and Quark XPress
    and the non-commercial players like Scribus (http://www.scribus.net/)
    are all working on providing support for CTL scripts. In this arena,
    the Open Source players are influenced by a different set of driving
    criteria than the commercial vendors: Does being Open Source encourage
    faster development of non-Latin script support? This question might
    be a great topic for some other student's research paper.

    In any case, the transparency of development in the Open Source world
    allows one to find out exactly how things stand. For example, here is
    the link to Scribus' "Support for Non-Latin Languages" meta-bug page:

               http://bugs.scribus.net/view.php?id=3965

    And in the case of Scribus, for example, one is welcome to contribute
    well-documented test cases (sample Unicode text along with references
    to fonts that are know to work correctly in other software) which the
    developers can use for testing the software.

    > 3/
    > As for bitmap, I meant the "Rigidly-fixed-width-character" requirements.
    > At present, the complex rendering (which is not working yet in these
    > systems) will produce extremly large width glyphs which will not be
    > accomodated by "rigidly-fixedwidth- requirements. What is the plan to
    > resolve this?
    >

    The only place where "rigidly fixed width" characters are normally required
    that I can think of is in terminal emulators. Once upon a time I
    investigated the idea of creating a terminal emulator --along with a
    bitmap font-- that would support scripts like Myanmar (Burmese),
    Tamil, etc. (Actually, from time to time, I still return to this
    idea).

    In existing terminal emulators, Latin glyphs take up one character
    cell each, while CJK glyphs are "double-width" and take up 2 character
    cells each. The GNU Unifont BMP bitmap font originally designed by
    Roman Czyborra (http://en.wikipedia.org/wiki/GNU_Unifont) provides a
    good example of how this works: most of the glyphs are 8 pixels wide
    by 16 pixels high, but the CJK glyphs are 16 pixels wide by 16 pixels
    high.

    In the hypothetical system as I had envisioned it, glyphs other than
    CJK glyphs could also be double-width. And, in fact, why limit
    ourselves to widths of 1 and 2 character cells? When I was
    investigating Myanmar, I thought that it actually would be *better* to
    allow some glyphs to stretch across 3 or even 4 character cells.

    We can think of this hypothetical terminal emulator as having a
    cartesian grid and glyphs of all scripts need to fit into discrete
    "quantum" cells : 1, 2, 3, or 4. (Maybe one could even make an
    argument for some glyph using up 5 quantum cells?)

    An experienced font designer (or team of designers) would then take up
    the challenge of creating a font to use with this terminal emulator.
    The font need not be a bitmap font -- it could just as easily be a
    vector font. For the sake of argument, let's say we allow this
    hypothetical terminal to use vector fonts (i.e., we could just make a
    special kind of OpenType font which could even have embedded bitmaps
    if desired).

    So for the various Latin blocks of Unicode we could start out with a
    suitable "monospaced" font. In a Latin monospaced font, all letters
    fit into fixed-width cells so that the advance distances on all glyphs
    are the same. This obviously requires some special aesthetic
    compromises, especially on the wide Latin letters like "m" and "w".

    To this originally "monospaced" font, we would now add additional
    blocks of Unicode. We could pretty much continue working within our
    "monospaced" design mantra through many blocks of Unicode -- until, of
    course, we hit scripts like Devanagari, Tamil, Myanmar, Khmer, and so
    on. Arabic too. At this point, our originally "monospaced" font
    becomes no longer "monospaced". Let's give it a new name -- how about
    "quantized font" or "quantum spaced font"? Or simply "quantum font" ?

    In this new quantum font, whenever an individual glyph became too
    horribly "squished" to fit inside one quantum character cell, then we
    would automatically try a 2-cell approach, and if even that did not
    work, then go for a 3-or 4-cell approach.

    As a quick and familiar example, let's use Arabic script. On Linux,
    the mlterm folks (http://mlterm.sourceforge.net/) have actually
    produced a "multilingual" terminal that even handles RTL Arabic. This
    is pretty cool. Mlterm uses GNU unifont for its Arabic glyphs.
    Arabic in mlterm is readable, which is nice, but it is really ugly.
    For example, terminal ARABIC LETTER SHEEN ش looks almost unbearably
    *squished*. Clearly, wide arabic letters like isolated or terminal
    ARABIC LETTER SHEEN ش or ARABIC LETTER SAAD ص would probably end up
    looking *much* nicer if we just allowed them to occupy 2 character
    cells. So, in this quantum font, most Arabic letters would still
    occupy just one character cell, but a few would occupy up to 2
    character cells.

    A similar principle would apply for the creation of the necessary
    glyphs for scripts like Myanmar and Tamil -- except in these cases
    there would be some glyphs that would necessarily take up 3 or even 4
    character cells.

    Well that's my idea, for what it is worth. I even tried my hand at
    creating a set of bitmap glyphs for Myanmar which could be added to
    GNU Unifont. But after wasting a lot of time on this, I realized I
    did not know how to write a terminal emulator. So, maybe someday I
    will return to this outlandish project. After I have learned how to
    write a terminal emulator.

    - Ed Trager



    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 10:13:41 CST