Re: Current support for N'Ko from Andrew Cunningham on 2014-09-29 (Unicode Mail List Archive)

From: Andrew Cunningham <lang.support_at_gmail.com>
Date: Tue, 30 Sep 2014 06:13:54 +1000

On 30/09/2014 4:11 AM, "David Starner" <prosfilaes_at_gmail.com> wrote:
>
> On Fri, Sep 26, 2014 at 4:10 PM, Andrew Cunningham
> <lang.support_at_gmail.com> wrote:
> > * NEVER try to copy and paste text from PDF. It is a preprint format and
> > should be treated as such.
>
>
> I'd try and cut and paste from print if I could. People are going to
> cut and paste from anything if it saves them a little time. If you
> disable cut and pasting from PDF, those who have easy access to OCR
> may just print to image and OCR it to cut and paste. To say don't do
> this is unproductive.
>

Ok what I should say is that in best case scenario for complex script text
you can copy and paste nd then do post processing on extracted text to get
the actual text. Post processing may involve reordering characters, or
systematic conversions of glyph sequences.

In worse case scenario you get utter garbage you can not reconstruct pdf
files from.

Searching and indexing is even more problematic.

Honestly, for languages I work with it would be quicker and more accurate
in many csses to use OCR (even at 80% accuracy) that cut and paste from PDF.

As I said in previous email results and effectiveness will differ depending
on fonts used and PDF generator used.

PDF was designed for preprint, not archival purposes.

> --
> Kie ekzistas vivo, ekzistas espero.
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Mon Sep 29 2014 - 15:15:00 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 29 2014 - 15:15:00 CDT