Re: Publishing complex scripts (was Re: minimizing ...)

From: Wunna Ko Ko (
Date: Sun Feb 17 2008 - 05:08:59 CST

  • Next message: Doug Ewell: "Re: A proportion with no solution?"

    Dear Sir,

    On Feb 10, 2008 4:43 PM, James Kass <> wrote:
    > Eric Muller wrote,
    > >> PDF has long been touted as *the* way to safely send text with the
    > >> assurance that the recipients will be able to display that text exactly
    > >> as the author intended.
    > >
    > > Actually, it is "final form documents", not text.
    > "Portable document format" implies more than merely a method of
    > exchanging graphic information intended to be sent to a printer
    > device by the end user. Indeed, PNG (portable netword graphics)
    > can probably be printed by users almost as well. PDF does have obvious
    > advantages over graphic file formats, though.
    > >> Without any real knowledge of the PDF format and what happens when
    > >> converting a file to PDF, it appears to me that it is not text which is
    > >> being embedded. Rather, the process is embedding glyphs.
    > >
    > > Glyphs is the primary construct that is needed for "final form
    > > documents". Glyphs are mandatory in PDFs.
    > I like glyphs and actually consider them useful.
    > > When you see something like "(the car) Tj" in a PDF content stream, the
    > > "the car" piece is only accidentally looking like text (of course an
    > > intended accident, but an accident nevertheless).
    > >> If a glyph
    > >> is mapped to a Unicode value, at least some applications can return that
    > >> value. But, if the glyph is not mapped to a unicode value (which is
    > >> normally the case with presentation forms used in complex scripts),
    > >> there does not seem to be any effort made to preserve the Unicode
    > >> string which generated the presentation form. And that's really a
    > >> shame.
    > >
    > > Actually, there are ways to include characters in additions to the
    > > glyphs, even when the character/glyph correspondence is not one-for-one
    > > (look for /ActualText in the PDF reference; /ToUnicode maps are
    > > conceptually optimizations of that), but whether those ways are
    > > exploited depend on the PDF generator. Some generators use nothing,
    > > other will generate only /ToUnicode (what you describe) which can
    > > account for only 1-to-1 character/glyph mappings, others will use the
    > > full apparatus.
    > We all look forward to developers implementing proper mechanisms
    > to preserve the original textual data.
    > > For example, if you take the PDFs generated for the UDHR in Unicode
    > > project (e.g.
    > > for a
    > > small comprehensive example), then except for the space problem
    > > mentioned earlier, I think that you can copy from Acrobat and paste in
    > > Notepad and get back all the text.
    > I've found the UDHR in Unicode PDF files to be quite helpful. It's a
    > worthwhile project, indeed.
    > Not having Acrobat installed here, I tried to test this anyway.
    > Vai looks fine:
    > ꕉꕜꕮ ꔔꘋ ꖸ ꔰ ꗋꘋ ꕮꕨ ꔔꘋ ꖸ ꕎ ꕉꖸꕊ ꕴꖃ ꕃꔤꘂ ꗱ, ꕉꖷ ꗪꗡ ꔻꔤ ꗏꗒꗡ ꕎ ꗪ ꕉꖸꕊ ꖏꕎ. ꕉꕡ ꖏ
    > ꗳꕮꕊ ꗏ ꕪ ꗓ ꕉꖷ ꕉꖸ ꕘꕞ ꗪ. ꖏꖷ ꕉꖸꔧ ꖏ ꖸ ꕚꕌꘂ ꗷꔤ ꕞ ꘃꖷ ꘉꔧ ꗠꖻ ꕞ ꖴꘋ ꔳꕩ ꕉꖸ ꗳ.
    > Tamil does not:
    > ம?த? ?ற???ன? சகல?? ?த??ரம?க?வ ?ற???றன? ; அவ?க? ம?????,
    > உ??மக??? சமம?னவ?க?, அவ?க? ?ய?ய??த?? மன?ச????ய??
    > இய?ப?ப?க? ?ப?றவ?க?. அவ?க? ஒ?வ?ட?ன??வ? ச?க?தர உண???
    > ப???? நட???க??ள? ?வ???.
    > Kannada looks worse than Tamil:
    > ಎ??? ??ನವರ? ಸ?ತಂತ?????? ಜ??ದ????. ??ಗ? ಘನ?? ಮತ?? ಹಕ??ಗಳ??? ಸ??ನ???ದ????. ????ಕ ಮತ??
    > ಅಂತಃಕರಣ ಗಳನ?? ಪ??ದವ??ದ? ?ಂದ ಅವರ? ಪರಸ?ರ ಸ????ದರ ??ವ?ಂದ ವ??ಸ??ಕ?.
    > Hebrew has spaces added:
    > כ ל ב נ י א ד ם נ ו ל ד ו ב נ י ח ו ר י ן ו ש ו ו י ם ב ע ר כ ם ו ב ז כ ו י ו ת י ה ם . כ ו ל ם ח ו נ נ ו ב ת
    > ב ו נ ה ו ב מ צ פ ו ן ,
    > ל פ י כ ך ח ו ב ה ע ל י ה ם ל נ ה ו ג א י ש ב ר ע ה ו ב ר ו ח ש ל א ח ו ה .
    > Burmese has some question marks, maybe a font problem here:
    > (Or maybe my system doesn't support Unicode 6.0 yet?)
    > လူတုိင်းသည် တူညီလွတ်လပ်?သာ ဂုဏ်သိက?ာဖ ြ င့်လည်း?ကာင်း၊ တူညီလွတ်လပ်?သာ
    > အခွင့်အ?ရးများဖ ြ င့်လည်း?ကာင်း၊ ?မွးဖွားလာသူများဖ ြ စ်သည်။ ထုိသူတုိ့၌ပုိင်းခ ြ ား?ဝဖန်တတ်?သာ ဉာဏ်နှင့်
    > ကျင့်ဝတ်သိတတ်?သာ စိတ်တုိ့ရှိက ြ ၍ ထုိသူတုိ့သည် အချင်းချင်း ?မတ?ာထား၍ ဆက်ဆံကျင့်သုံးသင့်၏။

    Your text is not encoded in Unicode 5.1 (beta). It has different code points.

    > As Eric points out, success may well depend upon the application used for
    > PDF generation as well as the application displaying the PDF from which
    > the text was copied into Notepad. I used Sumatra to display the PDF and
    > the CutePDF generating application.
    > Getting back to Sinnathurai Srivas' question about when will publishing
    > applications support complex scripts like Tamil... Tamil publishers can
    > successfully embed Tamil text into a PDF document, send it to a publishing
    > house, the publishing house can successfully print on paper from the PDF,
    > bind the printed paper into a book, put the book on the market, and
    > hope the books sells well.
    > So, I'd say the answer is "now", at least for some aspects of publishing
    > and some publishing applications.
    > As far as any other problems associated with PDFs and complex scripts,
    > if we look ten years into the past, there were *no* applications
    > whatsoever which supported Unicode Tamil. We've come a long way
    > in a relatively short time. We still have some distance to travel, though.
    > Best regards,
    > James Kass

    Wunna Ko Ko
    Get Paid To Read Emails. Free To Join Now!

    This archive was generated by hypermail 2.1.5 : Sun Feb 17 2008 - 05:11:35 CST