Re: minimizing size (was Re: allocation of Georgian letters)

From: Sinnathurai Srivas (sisrivas@blueyonder.co.uk)
Date: Sat Feb 09 2008 - 10:09:43 CST

Next message: Bala: "RE: minimizing size (was Re: allocation of Georgian letters)"

Previous message: Jeroen Ruigrok van der Werven: "Localized software (was: Re: minimizing size)"
In reply to: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Reply: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

>>>>>
A backward compatible solution would include allowing the current (CTL)
encoding to work while canonical forms would allow non-CTL encoding.
Both can work seemlesly.

Tamil already has a canonical form (though wrongly) defined already. We need
to expand on providing canonical forms in a logical manner.

Forward Compatible:
One more thing, criteria used for making Devanagari nor Bengali may be or
may not be correct in making them CTL encodings. However, the criteria used
to make Tamil a CTL is not correct. Ofcourse backward compatibility (even
forward compatibility can be considered) must be maintained. Is a non-CTL
solution, that is made forward-compaticle to CTL is a possibility? Can I
expect a solution for rigidly fixedwidth requirements using canonical forms?
Can I expect an IMMEDIATE solution to use Unicode in publishing, utilising
canonical forms? The existing will encoding stays primary, while canonical
forms assist in special circumstances?

Sinnathurai

>>>>
On Feb 8, 2008 4:12 PM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
wrote:
> Again what is the criteria for stopping Tamil using workable solution and
> what is the criteria for enforcing non-working solution?

Unicode will change its encoding of Tamil in a non-backward compatible
way when hell freezes over. This system may be suboptimal, but it is
the same system as used for Devanagari and Bengali, and does work, if
not as well or in as many systems as you may hope.

----- Original Message -----
From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
To: "Unicode Discussion" <unicode@unicode.org>
Sent: 08 February 2008 22:02
Subject: Re: minimizing size (was Re: allocation of Georgian letters)

>
John H. Jenkins wrote in a mail,
"
Even if Unicode had used an encoding model for South Asian scripts
that didn't require complex rendering, the current problem would exist
because then text would display correctly but, for example, databases
would have to be substantially rewritten to convert the glyph stream
back into a series of letters for the operations that they typically
support.
"
>>
John, could you expand on the above,
(Additionally, please include what effect canonical forms would have on
databses.)
My initial thinking is non-CTL Tamil would work in databases without
additional interventions.

Sinnathurai

>>>

I'm not sufficiently familiar with Tamil to comment intelligently on
it. My responses were aimed at the more general issue of why complex
scripts are required and to try to clarify the reasons why some
scripts require complex rendering and others don't. If your questions
have to do with Tamil specifically, it would be better for them to be
answered by someone more familiar with the script.

=====
John H. Jenkins
jenkins@apple.com
----- Original Message -----
From: "Sinnathurai Srivas" <sisrivas@blueyonder.co.uk>
To: "Unicode Discussion" <unicode@unicode.org>
Sent: 08 February 2008 21:12
Subject: Re: minimizing size (was Re: allocation of Georgian letters)

1/
My question was what is the criteria used to class a language as
a/ That requires complex rendering
b/ That does not require complex rendering.

Tamil need not be a CTL script. It can work 100% and work better than CTL
enabled Tamil. Why then is Tamil classed as CTL script? What is the
criteria?

For example Tamil could easily be implemented without the need for any
complex rendering
However, Tamil is currently implemented using complex rendering.
This was one of the main discussions and I have not seen a viable answer
that catergorically states for such and such TECHNICAL reasons Tamil was
made one that requires Complex rendering.

as for fixedwidth,

For example, in Tamil, currently the one cell fixedwidth font is acheived
using 8bit encoding. It can not be obtained using Unicode as it stands.
Though, introducing cannonical forms can resolve this by enabling single
width for all necessary glyps.
Not only Terminal emulators, there are many electronic devices, such as
settop box, etc uses RIGID fixedwidth. I do not know how CJK handles
settop box, etc... Anyway, why move into unnesessary complexities, when
Tamil can work perfectly well as a non-CTL script or alternatively by
defining canonical forms to do away with complex rendering requirements!!

I think regid-fixedwidth for Tamil with it's rendered form is NOT
ACHEIVABLE.

As for publishing, attempt to use Unicode Tamil fails. If it is acheivable,
when will it be ready?

Again what is the criteria for stopping Tamil using workable solution and
what is the criteria for enforcing non-working solution?

I think we can atleast move fast, if we introduce all necessary canonical
forms now, most of the publishing s/w may work with canonical forms.

Sinnathurai

----- Original Message -----
From: "Ed Trager" <ed.trager@gmail.com>
To: "Unicode Discussion" <unicode@unicode.org>
Sent: 08 February 2008 20:02
Subject: Re: minimizing size (was Re: allocation of Georgian letters)

Hi, everyone,

Just a few brief comments on this thread:

>
> Having flown halfway around the world to talk to people who for whatever
> reasons, both valid and invalid (and not really distinguishing which is
> which on their list of concerns), are unhappy with a language encoding
> that
> in their view doubles or worse the amount of bytes used to store their
> language in Unicode, I can tell you that this as very real concern on some
> people's minds.
>
> True or false, it is on their minds. They can all add and multiply, and it
> certainly looks like a 2x or 3x situation to them.
>

Of course it is on their minds! Judging from the titles of emails in
my spam box, size really does matter. But apparently what humanity
really wants to do is MAXIMIZE the size, not minimize it. So a 2x or
3x situation should be good. :-)

On Feb 8, 2008 5:52 AM, Sinnathurai Srivas <sisrivas@blueyonder.co.uk>
wrote:
> 2/
> My question was, mostly all proper publishing softwares do not yet support
> complex rendering. How many years since Unicode come into being?
> When is this going to be resolved, or do we plan on choosing an
> alternative
> encoding as Unicode is not working.
>

Unicode does in fact work very well. Implementing good Unicode
support for complex text layout (CTL) scripts like Tamil is
achievable. Not sure what "proper publishing software" includes --
For example, would that include http://ta.wikipedia.org/ ?

From an economic perspective, when the markets in South and Southeast
Asia that require complex text layout look enticing enough to the
software vendors, then the problem will be solved. Is it possible
that rampant piracy of commercial software throughout Asia actually
contributes to the problem of poor support for many Asian scripts in
heavy-weight commercial software like Adobe InDesign? This question
might be a great topic of some student's research paper.

Clearly the commercial players like Adobe InDesign and Quark XPress
and the non-commercial players like Scribus (http://www.scribus.net/)
are all working on providing support for CTL scripts. In this arena,
the Open Source players are influenced by a different set of driving
criteria than the commercial vendors: Does being Open Source encourage
faster development of non-Latin script support? This question might
be a great topic for some other student's research paper.

In any case, the transparency of development in the Open Source world
allows one to find out exactly how things stand. For example, here is
the link to Scribus' "Support for Non-Latin Languages" meta-bug page:

http://bugs.scribus.net/view.php?id=3965

And in the case of Scribus, for example, one is welcome to contribute
well-documented test cases (sample Unicode text along with references
to fonts that are know to work correctly in other software) which the
developers can use for testing the software.

> 3/
> As for bitmap, I meant the "Rigidly-fixed-width-character" requirements.
> At present, the complex rendering (which is not working yet in these
> systems) will produce extremly large width glyphs which will not be
> accomodated by "rigidly-fixedwidth- requirements. What is the plan to
> resolve this?
>

The only place where "rigidly fixed width" characters are normally required
that I can think of is in terminal emulators. Once upon a time I
investigated the idea of creating a terminal emulator --along with a
bitmap font-- that would support scripts like Myanmar (Burmese),
Tamil, etc. (Actually, from time to time, I still return to this
idea).

In existing terminal emulators, Latin glyphs take up one character
cell each, while CJK glyphs are "double-width" and take up 2 character
cells each. The GNU Unifont BMP bitmap font originally designed by
Roman Czyborra (http://en.wikipedia.org/wiki/GNU_Unifont) provides a
good example of how this works: most of the glyphs are 8 pixels wide
by 16 pixels high, but the CJK glyphs are 16 pixels wide by 16 pixels
high.

In the hypothetical system as I had envisioned it, glyphs other than
CJK glyphs could also be double-width. And, in fact, why limit
ourselves to widths of 1 and 2 character cells? When I was
investigating Myanmar, I thought that it actually would be *better* to
allow some glyphs to stretch across 3 or even 4 character cells.

We can think of this hypothetical terminal emulator as having a
cartesian grid and glyphs of all scripts need to fit into discrete
"quantum" cells : 1, 2, 3, or 4. (Maybe one could even make an
argument for some glyph using up 5 quantum cells?)

An experienced font designer (or team of designers) would then take up
the challenge of creating a font to use with this terminal emulator.
The font need not be a bitmap font -- it could just as easily be a
vector font. For the sake of argument, let's say we allow this
hypothetical terminal to use vector fonts (i.e., we could just make a
special kind of OpenType font which could even have embedded bitmaps
if desired).

So for the various Latin blocks of Unicode we could start out with a
suitable "monospaced" font. In a Latin monospaced font, all letters
fit into fixed-width cells so that the advance distances on all glyphs
are the same. This obviously requires some special aesthetic
compromises, especially on the wide Latin letters like "m" and "w".

To this originally "monospaced" font, we would now add additional
blocks of Unicode. We could pretty much continue working within our
"monospaced" design mantra through many blocks of Unicode -- until, of
course, we hit scripts like Devanagari, Tamil, Myanmar, Khmer, and so
on. Arabic too. At this point, our originally "monospaced" font
becomes no longer "monospaced". Let's give it a new name -- how about
"quantized font" or "quantum spaced font"? Or simply "quantum font" ?

In this new quantum font, whenever an individual glyph became too
horribly "squished" to fit inside one quantum character cell, then we
would automatically try a 2-cell approach, and if even that did not
work, then go for a 3-or 4-cell approach.

As a quick and familiar example, let's use Arabic script. On Linux,
the mlterm folks (http://mlterm.sourceforge.net/) have actually
produced a "multilingual" terminal that even handles RTL Arabic. This
is pretty cool. Mlterm uses GNU unifont for its Arabic glyphs.
Arabic in mlterm is readable, which is nice, but it is really ugly.
For example, terminal ARABIC LETTER SHEEN ش looks almost unbearably
*squished*. Clearly, wide arabic letters like isolated or terminal
ARABIC LETTER SHEEN ش or ARABIC LETTER SAAD ص would probably end up
looking *much* nicer if we just allowed them to occupy 2 character
cells. So, in this quantum font, most Arabic letters would still
occupy just one character cell, but a few would occupy up to 2
character cells.

A similar principle would apply for the creation of the necessary
glyphs for scripts like Myanmar and Tamil -- except in these cases
there would be some glyphs that would necessarily take up 3 or even 4
character cells.

Well that's my idea, for what it is worth. I even tried my hand at
creating a set of bitmap glyphs for Myanmar which could be added to
GNU Unifont. But after wasting a lot of time on this, I realized I
did not know how to write a terminal emulator. So, maybe someday I
will return to this outlandish project. After I have learned how to
write a terminal emulator.

- Ed Trager

Next message: Bala: "RE: minimizing size (was Re: allocation of Georgian letters)"
Previous message: Jeroen Ruigrok van der Werven: "Localized software (was: Re: minimizing size)"
In reply to: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Next in thread: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Reply: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 10:13:41 CST