Re: Joined "ti" coded as "Ɵ" in PDF from Philippe Verdy on 2016-03-17 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 17 Mar 2016 21:18:35 +0100

2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi <olopierpa_at_gmail.com>:

> On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko <leoboiko_at_namakajiri.net>
> wrote:
> > The PDF *displays* correctly. But try copying the string 'ti' from
> > the text another application outside of your PDF viewer, and you'll
> > see that the thing that *displays* as 'ti' is *coded* as Ɵ, as Don
> > Osborn said.
>
> Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about
> unicode. It uses the encoding of the fonts used.
>

That's correct, however the PDF specs contain guidelines for naming glyphs
in fonts in such a way that the encoding can be deciphered. This is needed
for example in applications such as PDF forms where user input is expected.
When those PDF are generated from rich text, the fonts used may be built
with TrueType (without any glyph name in them, only mappings of sequences
of codepoints) or OpenType or Postscript. When OpenType fonts contain
Postscript glyphs, their names may be completely arbitrary, it does not
even matter if the font used was mapped to Unciode or if it used a legacy
or proprietary encoding).

If you see a "Ɵ" when copy-pasting from the PDF, it's because the font used
to produce it did not follow these guidelines (or did not specify any
glyphname, in which case this is a sort of OCR algorithm that attempts to
decipher the glyph : the "ti" ligature is visually extremely near from the
"Ɵ", and an OCR has lot of difficulties to disguish them, unless they also
use some linguistic dictionnary searches and some hints about the script
used in surrounding characters to enhance the guess).

Note that PDF's (or DejaVu's) are not required to contain only text, or
they could just embed a scanned and compressed bitmap image (if you want to
see how an OCR can be wrong, look at how it fails with lots of errors, for
example in the decoding projects for Wikibooks, working with scanned
bitmaps of old books: OCR is just an helper, but there's still lot of work
to correct what has been guessed and reencode the correct text; even if
humans are smarter than OCR, this is a lot of work to perform manually :
encoding the text of a single scanned old book still takes one or two
months for an experienced editor, and there are still many errors to review
later by someone else)

Most PDFs were not created with the idea of decoding later their rendered
texts. In fact they were intended to be read or printed "as is", including
with their styles, colors, and decorations of fonts everywhere or text over
photos. They were even created to be non modifiable and used then for
archival.

Some PDF tools will also cleanup from the PDF the additional metadata such
as the original fonts used, instead these PDFs will locally embed
pseudo-fonts containing sets of glyphs from various fonts (in mixed
styles), in random order or sorted by frequency of use in the document or
by order of occurence in the original text. These embedded fonts are
generated on the fly to contain only the necessary glyphs for the document.
When those embedded fonts are generated, there's a compression algorithme
that drops lots of things from the original font, including its metadata
such as the original "Postscript" glyph names.
Received on Thu Mar 17 2016 - 15:20:17 CDT

This archive was generated by hypermail 2.2.0 : Thu Mar 17 2016 - 15:20:17 CDT