Re: minimizing size (was Re: allocation of Georgian letters)

From: Doug Ewell (
Date: Sat Feb 09 2008 - 13:50:10 CST

  • Next message: Doug Ewell: "Re: minimizing size (was Re: allocation of Georgian letters)"

    James Kass <thunder dash bird at earthlink dot net> wrote:

    >> Except in the presence of bugs such as this, Unicode data can be
    >> copied and pasted from one Unicode-aware program to another
    >> Unicode-aware program with 100% fidelity, regardless of the encoding
    >> model.
    > (Andrew responds well to reported problems, but how can he fix bugs in
    > third-party PDF applications?)

    I am pretty sure this is a BabelPad bug, related to the pasting of text
    into BabelPad, not the copying of text from PDF.

    > The operative phrase is "Unicode-aware application". I believe it
    > would possible to copy/paste text back-and-forth between BabelPad and
    > Notepad until the mouse wore out without data corruption.

    At the risk of dragging an otherwise excellent text editor further
    through the mud, and solely in the interest of improving BP, I can try
    to produce an example where extra crud is pasted into BP after the
    "real" text.

    > PDF has long been touted as *the* way to safely send text with the
    > assurance that the recipients will be able to display that text
    > exactly as the author intended. While it's true that the recipient
    > sees what was intended, it does not seem to be true that actual text
    > is being sent. Once the material is in PDF format, no further text
    > processing appears to be possible; the actual text has been lost
    > somewhere along the way. (ASCII text notwithstanding.)

    This is an important point: for at least some applications of PDF, the
    recipient can display the text exactly as the author intended, but
    cannot necessarily do anything else with it.

    > Another shame is telling Tamil users that Unicode won't standardize a
    > duplicate encoding until a certain event happens. This gives the
    > misleading impression that there's at least a possibility that Unicode
    > might encode TACE/TUNE.

    Indeed, as I have said many times. Regardless of how firm someone may
    have actually been in a meeting, the reports and meeting minutes have
    consistently indicated that encoding TACE/TUNE in Unicode is a
    possibility, which is either misleading to the proponents (if false) or
    a complete destabilizing of Tamil in Unicode (if true).

    > P.S. - There's a special FAQ page for Tamil encoding issues here:
    > Suggested additions to that page might include:
    > Q: Is there any possibility that a new character encoding scheme for
    > Tamil which considers ligatures as characters will either be added to
    > Unicode side-by-side with the existing Unicode Tamil encoding or
    > replace the current Tamil Unicode encoding model altogether?
    > A: No.

    Q: Then how can we map text between the current Tamil Unicode encoding
    model and a more "correct" sequence of units that reflects the way Tamil
    script users think of their script?

    A: By using the named sequences provided in
    The use of named sequences is described in UAX #34, "Unicode Named
    Character Sequences."

    (Note: the "provisional" named sequences for Tamil will probably need to
    be upgraded to full approved status before users will take this advice

    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14  ˆ

    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 13:52:21 CST