RE: minimizing size (was Re: allocation of Georgian letters)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 09 2008 - 17:22:55 CST

  • Next message: Eric Muller: "Re: minimizing size (was Re: allocation of Georgian letters)"

    James Kass wrote:
    > I'm coming from the old IBM-PC days when control "C" copied
    > selected text into the buffer, and control "V" copied
    > whatever is in the buffer to the active text area. (Still
    > works, too! Except that the buffer now apparently accepts
    > non-textual data.)

    This is not a new feature of the clipboard; in fact the Windows clipboard
    accepts several formats, and the "memory bufer" is not comletely filled
    before both applications (plus the clipboard itself that supports some basic
    formats and that will keep the copied data internally by performing data
    conversion when the data gets actually copied into it) agree on the format
    to use.

    Since always, the clipboard contains not a single data, but several ones
    that need to be enumerated; the application accepting data from the
    clipboard should enumerate the formats to see which one best fits its needs;
    However the clipboard itself does not verify which data is getting copied
    into: if the source application says it is basic text, then the clipboard
    keeps it as is, just converting it to Unicode internally of the source
    application uses another encoding, or keeping the local system's ANSI or OEM
    codepage.

    The clipboard internal formats should always be negociated.

    But it's true that some application are putting some garbage data into the
    clipboard when performing copies into it. One of them is Adobe reader, but
    this comes most of often from the fact that PDF documents were created with
    custom fonts that don't obey to a standard encoding, or where the encoding
    was "tweaked" to reuse another "similar" encoding within these fonts, with
    non-standard mapping from text to glyphs.

    This happens quite often with some PDF creation tools that are building
    custom fonts to reduce the size of the PDF, by not embedding the original
    font definitions, but assigning linear codes foreach glyph as they appear in
    the source text, in random order. How can Adobe Reader "guess" which
    character maps to the effective glyph ids used in the PDF? That's a
    difficult task. Not all PDFs are created for allowing copy-pasting from
    them, they are just designed to be viewed or printed the way they were
    designed in the original document and nowhere else.

    A PDF document is not a text document but a collection of drawing primitives
    and collections of glyphs that are not necessarily indexed by some standard
    character encoding because the encoding effectively used is only local to
    the document itself; however not respecting some conventions will disable
    some important features of PDF documents, such as the possibility of
    performing reliably full text searches in them and indexing large
    collections of documents.

    Don't blame too much Adobe Reader, blame the PDF creation tools for not
    respecting these conventions, and the authors of these tools for not
    verifying that the tool will permit reuse of the document content by
    legitimate document authors.



    This archive was generated by hypermail 2.1.5 : Sat Feb 09 2008 - 18:58:11 CST