Publishing complex scripts (was Re: minimizing ...)

From: James Kass (thunder-bird@earthlink.net)
Date: Sun Feb 10 2008 - 04:13:39 CST

  • Next message: André Szabolcs Szelp: "changing address?"

    Eric Muller wrote,

    >> PDF has long been touted as *the* way to safely send text with the
    >> assurance that the recipients will be able to display that text exactly
    >> as the author intended.
    >
    > Actually, it is "final form documents", not text.

    "Portable document format" implies more than merely a method of
    exchanging graphic information intended to be sent to a printer
    device by the end user. Indeed, PNG (portable netword graphics)
    can probably be printed by users almost as well. PDF does have obvious
    advantages over graphic file formats, though.

    >> Without any real knowledge of the PDF format and what happens when
    >> converting a file to PDF, it appears to me that it is not text which is
    >> being embedded. Rather, the process is embedding glyphs.
    >
    > Glyphs is the primary construct that is needed for "final form
    > documents". Glyphs are mandatory in PDFs.

    I like glyphs and actually consider them useful.

    > When you see something like "(the car) Tj" in a PDF content stream, the
    > "the car" piece is only accidentally looking like text (of course an
    > intended accident, but an accident nevertheless).
    >> If a glyph
    >> is mapped to a Unicode value, at least some applications can return that
    >> value. But, if the glyph is not mapped to a unicode value (which is
    >> normally the case with presentation forms used in complex scripts),
    >> there does not seem to be any effort made to preserve the Unicode
    >> string which generated the presentation form. And that's really a
    >> shame.
    >
    > Actually, there are ways to include characters in additions to the
    > glyphs, even when the character/glyph correspondence is not one-for-one
    > (look for /ActualText in the PDF reference; /ToUnicode maps are
    > conceptually optimizations of that), but whether those ways are
    > exploited depend on the PDF generator. Some generators use nothing,
    > other will generate only /ToUnicode (what you describe) which can
    > account for only 1-to-1 character/glyph mappings, others will use the
    > full apparatus.

    We all look forward to developers implementing proper mechanisms
    to preserve the original textual data.

    > For example, if you take the PDFs generated for the UDHR in Unicode
    > project (e.g.
    > http://www.unicode.org/udhr/assemblies/first_article_subset.pdf for a
    > small comprehensive example), then except for the space problem
    > mentioned earlier, I think that you can copy from Acrobat and paste in
    > Notepad and get back all the text.

    I've found the UDHR in Unicode PDF files to be quite helpful. It's a
    worthwhile project, indeed.

    Not having Acrobat installed here, I tried to test this anyway.

    Vai looks fine:

    ꕉꕜꕮ ꔔꘋ ê–¸ ê”° ꗋꘋ ꕮꕨ ꔔꘋ ê–¸ ê•Ž ꕉꖸꕊ ê•´ê–ƒ ꕃꔤꘂ ê—±, ꕉꖷ ꗪꗡ ꔻꔤ ê—ê—’ê—¡ ê•Ž ê—ª ꕉꖸꕊ ê–ê•Ž. ꕉꕡ ê–
    ꗳꕮꕊ ê— ê•ª ê—“ ꕉꖷ ꕉꖸ ꕘꕞ ê—ª. ê–ê–· ꕉꖸꔧ ê– ê–¸ ꕚꕌꘂ ꗷꔤ ê•ž ꘃꖷ ꘉꔧ ê— ê–» ê•ž ꖴꘋ ꔳꕩ ꕉꖸ ê—³.

    Tamil does not:

    ம?த? ?ற???ன? சகல?? ?த??ரம?க?வ ?ற???றன? ; அவ?க? ம?????,
    உ??மக??? சமம?னவ?க?, அவ?க? ?ய?ய??த?? மன?ச????ய??
    இய?ப?ப?க? ?ப?றவ?க?. அவ?க? ஒ?வ?ட?ன??வ? ச?க?தர உண???
    ப???? நட???க??ள? ?வ???.

    Kannada looks worse than Tamil:

    ಎ??? ??ನವರ? ಸ?ತಂತ?????? ಜ??ದ????. ??ಗ? ಘನ?? ಮತ?? ಹಕ??ಗಳ??? ಸ??ನ???ದ????. ????ಕ ಮತ??
    ಅಂತಃಕರಣ ಗಳನ?? ಪ??ದವ??ದ? ?ಂದ ಅವರ? ಪರಸ?ರ ಸ????ದರ ??ವ?ಂದ ವ??ಸ??ಕ?.

    Hebrew has spaces added:

    ×› ל ב ×  ×™ × ×“ × ×  ו ל ד ו ב ×  ×™ ×— ו ר ×™ ן ו ש ו ו ×™ × ×‘ ×¢ ר ×› × ×• ב ×– ×› ו ×™ ו ת ×™ ×” × . ×› ו ל × ×— ו ×  ×  ו ב ת
    ב ו נ ה ו ב מ צ פ ו ן ,
    ל פ ×™ ×› ך ×— ו ב ×” ×¢ ל ×™ ×” × ×œ ×  ×” ו ×’ × ×™ ש ב ר ×¢ ×” ו ב ר ו ×— ש ל × ×— ו ×” .

    Burmese has some question marks, maybe a font problem here:
    (Or maybe my system doesn't support Unicode 6.0 yet?)

    လူá€á€¯á€­á€„်းသည် á€á€°á€Šá€®á€œá€½á€á€ºá€œá€•á€º?သာ ဂုá€á€ºá€žá€­á€€?ာဖ ြ င့်လည်း?ကာင်းአá€á€°á€Šá€®á€œá€½á€á€ºá€œá€•á€º?သာ
    အá€á€½á€„့်အ?ရးများဖ ြ င့်လည်း?ကာင်းአ?မွးဖွားလာသူများဖ ြ စ်သည်ዠထုိသူá€á€¯á€­á€·áŒá€•á€¯á€­á€„်းဠြ ား?á€á€–န်á€á€á€º?သာ ဉာá€á€ºá€”ှင့်
    ကျင့်á€á€á€ºá€žá€­á€á€á€º?သာ စိá€á€ºá€á€¯á€­á€·á€›á€¾á€­á€€ ြ á ထုိသူá€á€¯á€­á€·á€žá€Šá€º အá€á€»á€„်းá€á€»á€„်း ?မá€?ာထားá ဆက်ဆံကျင့်သုံးသင့်áá‹

    As Eric points out, success may well depend upon the application used for
    PDF generation as well as the application displaying the PDF from which
    the text was copied into Notepad. I used Sumatra to display the PDF and
    the CutePDF generating application.

    Getting back to Sinnathurai Srivas' question about when will publishing
    applications support complex scripts like Tamil... Tamil publishers can
    successfully embed Tamil text into a PDF document, send it to a publishing
    house, the publishing house can successfully print on paper from the PDF,
    bind the printed paper into a book, put the book on the market, and
    hope the books sells well.

    So, I'd say the answer is "now", at least for some aspects of publishing
    and some publishing applications.

    As far as any other problems associated with PDFs and complex scripts,
    if we look ten years into the past, there were *no* applications
    whatsoever which supported Unicode Tamil. We've come a long way
    in a relatively short time. We still have some distance to travel, though.

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Sun Feb 10 2008 - 04:17:22 CST