From: Ed Trager (firstname.lastname@example.org)
Date: Mon Nov 12 2007 - 11:57:49 CST
Canadiana ( http://www.canadiana.org/ ) is a good example of how one
group is tackling the problem of making old and very old printed texts
available to everyone via the internet. Early Canadiana Online (ECO)
is a digital library providing access to 2,612,532 pages of Canada's
printed heritage. It features works published from the time of the
first European settlers up to the early 20th Century.
According to Canadiana's technical information page, TIFF scans from
old books are converted to GIF and PDF formats on the fly for delivery
to online users. The scanned pages are also OCR'ed using a completely
automated process. Automated OCR sounds *scary* to me, but that's
what they do. Texts are primarily French, English, or bilingual
French-English. The bilingual texts are apparently OCR'ed twice and
both results stored in the database.
I'm not sure if the OCR'ed results are only used to enable
context-based searching of the texts or what exactly. One has to
assume that the OCR software will result in computerized plain-text
that is full of errors (that's why I said it is "scary" ;-) ). There
is not only the problem of old books using old typefaces that may not
be very "clear" to the OCR software, but also the problem that the
French and English languages, and accepted spelling in these
languages, has also changed significantly since the publication of
many of these books.
For example, here's a page from R.P. Louis Hennepin's "Nouvelle
découverte d'un très grand pays situé dans l'Amérique, entre le
Nouveau Mexique et la mer glaciale" (Chez Guillaume Broedelet, Utrect,
1697) describing with some artistic license and self-grandeur the
natural abundance around the Straits of Detroit (Michigan) at that
I believe this page has some examples of rather old French verb forms,
and of course one sees the old form of lower case "s" which
facilitates the use of interesting ligatures for "st" and "ss", etc.
Having a resource like Early Canadiana Online is a fantastic boon to
scholars and casual historical researchers alike. And I am glad that
the original scanned images are made available because I would be very
wary if I had access only to the OCR'ed text.
Reflecting upon this online resource, there are a number of
technological musings that one can ponder. For example, wouldn't it
be neat if it were possible to search, Google-style, on a search term
in this book --say "chasse"-- and then have the result set appear with
this original scan of the page but now with the word "chasse"
highlighted in color just as if we were searching a plain-text digital
It would be interesting to contact the folks who have brought us
Canadiana.org and see what kinds of technical obstacles they have
faced, and what solutions they have in the works for the future.
-- Ed Trager
On Nov 11, 2007 3:17 PM, John Hudson <email@example.com> wrote:
> As a type designer, I'm mainly involved in the business of making text visible and trying
> to get it to display correctly. I'm going to be speaking at a conference in January* and
> as background for my talk I'm interested in cataloguing the other things that people do,
> or want to do, with computerised text. It strikes me that many of these things take place
> in the realm in which text is 'invisible', i.e. prior to or independent of display. Some
> of these things are obvious in a general sense (spelling and grammar checking, sorting,
> comparing), but I'd like to come up with some specific and interesting examples --
> particularly of a scholarly nature --, and also would like know of any other things that
> people 'do with text' beyond displaying it. I suspect that there are things I have not
> even imagined within my narrow focus.
> John Hudson
> * http://www.bibletechconference.com/
> Tiro Typeworks www.tiro.com
> Gulf Islands, BC firstname.lastname@example.org
> A bilabial velaric ingressive stop is essentially a kiss.
> -- Pullum & Ladusaw, _Phonetic symbol guide_
This archive was generated by hypermail 2.1.5 : Mon Nov 12 2007 - 12:00:51 CST