RE: minimizing size (was Re: allocation of Georgian letters)

From: Bala (bala@cse.mrt.ac.lk)
Date: Sat Feb 09 2008 - 11:19:14 CST

  • Next message: Sinnathurai Srivas: "Re: minimizing size (was Re: allocation of Georgian letters)"

    James Kass----
    Another shame is telling Tamil users that Unicode won't standardize
    a duplicate encoding until a certain event happens. This gives the
    misleading impression that there's at least a possibility that Unicode
    might encode TACE/TUNE.

    It would have been much better, my opinion, to simply have told people
    up front that there is absolutely no possibility whatsoever for such a
    duplicate encoding in the standard. In which case, the people who have
    spent time and effort towards such an encoding could have been doing
    something productive with their time and resources instead of wasting
    them. Like, for example, solving problems with the PDF format
    related to complex scripts.
    -----

    I attended to the Chennai meeting last month. UTC were very clearly mentioned that dual encoding it's not possible at all in any stage in the meeting.
    They suggested few other solutions in case if TACE wanted to be used. (Like IANA)

    Anyway is not mean that Tamil is a complex script. In present Tamil we have a defined set of elements (326) which used to built the text. If you take the Indic languages, from my understanding Sinhala has the more letters and Tamil has the least letters. Except Tamil, other Indic languages does have the combined forms which produce the combined letters and make the language complex. In Sinhala there are few thousands letters can be logically generated. Some of the letters people are not using in the text, but logically there is such letters. However in Tamil there not combined letters concepts at all.

    In Tamil there is only 1 Conjoint Consonant (ksh) and 1 Conjoint syllable (Shrii) are presently been used in text. These are totally borrowed elements. This why Tamil should not be considered as complex script and expected as Level 1 encoding in Unicode. However Unicode were very clear in the Chennai meeting that dual encoding is not possible and present encoding cannot be deprecated as well.

    Thank you

    Kind Regards
    Bala

    -----Original Message-----
    From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On Behalf Of James Kass
    Sent: Saturday, February 09, 2008 3:25 PM
    To: Unicode Mailing List
    Subject: Re: minimizing size (was Re: allocation of Georgian letters)

    Doug Ewell wrote,

    > As much as I like BabelPad (it has replaced SC UniPad as my favorite
    > full-service-Unicode editor), I have had serious problems pasting text
    > into BabelPad from the clipboard. Sometimes there is a large chunk of
    > random text after the "real" data; there have been other symptoms as
    > well. I assume Andrew will be able to resolve these when he has a
    > chance to update the program.
    >
    > Except in the presence of bugs such as this, Unicode data can be copied
    > and pasted from one Unicode-aware program to another Unicode-aware
    > program with 100% fidelity, regardless of the encoding model.

    (Andrew responds well to reported problems, but how can he fix bugs
    in third-party PDF applications?)

    The operative phrase is "Unicode-aware application". I believe it would
    possible to copy/paste text back-and-forth between BabelPad and
    Notepad until the mouse wore out without data corruption.

    PDF has long been touted as *the* way to safely send text with the
    assurance that the recipients will be able to display that text exactly
    as the author intended. While it's true that the recipient sees what
    was intended, it does not seem to be true that actual text is being
    sent. Once the material is in PDF format, no further text processing
    appears to be possible; the actual text has been lost somewhere along
    the way. (ASCII text notwithstanding.)

    Without any real knowledge of the PDF format and what happens when
    converting a file to PDF, it appears to me that it is not text which is
    being embedded. Rather, the process is embedding glyphs. If a glyph
    is mapped to a Unicode value, at least some applications can return that
    value. But, if the glyph is not mapped to a unicode value (which is
    normally the case with presentation forms used in complex scripts),
    there does not seem to be any effort made to preserve the Unicode
    string which generated the presentation form. And that's really a
    shame.

    Another shame is telling Tamil users that Unicode won't standardize
    a duplicate encoding until a certain event happens. This gives the
    misleading impression that there's at least a possibility that Unicode
    might encode TACE/TUNE.

    It would have been much better, my opinion, to simply have told people
    up front that there is absolutely no possibility whatsoever for such a
    duplicate encoding in the standard. In which case, the people who have
    spent time and effort towards such an encoding could have been doing
    something productive with their time and resources instead of wasting
    them. Like, for example, solving problems with the PDF format
    related to complex scripts.

    Best regards,

    James Kass

    P.S. - There's a special FAQ page for Tamil encoding issues here:
    http://unicode.org/faq/tamil.html

    Suggested additions to that page might include:

    Q: Is there any possibility that a new character encoding scheme for
    Tamil which considers ligatures as characters will either be added to
    Unicode side-by-side with the existing Unicode Tamil encoding or
    replace the current Tamil Unicode encoding model altogether?

    A: No.



    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 11:22:39 CST