Re: Things to do with text

From: Ed Trager (ed.trager@gmail.com)
Date: Mon Nov 12 2007 - 11:57:49 CST

  • Next message: Anto'nio Martins-Tuva'lkin: "Batak toba (was: "Re: New Member")"

    Hi, John,

    Canadiana ( http://www.canadiana.org/ ) is a good example of how one
    group is tackling the problem of making old and very old printed texts
    available to everyone via the internet. Early Canadiana Online (ECO)
    is a digital library providing access to 2,612,532 pages of Canada's
    printed heritage. It features works published from the time of the
    first European settlers up to the early 20th Century.

    According to Canadiana's technical information page, TIFF scans from
    old books are converted to GIF and PDF formats on the fly for delivery
    to online users. The scanned pages are also OCR'ed using a completely
    automated process. Automated OCR sounds *scary* to me, but that's
    what they do. Texts are primarily French, English, or bilingual
    French-English. The bilingual texts are apparently OCR'ed twice and
    both results stored in the database.

    I'm not sure if the OCR'ed results are only used to enable
    context-based searching of the texts or what exactly. One has to
    assume that the OCR software will result in computerized plain-text
    that is full of errors (that's why I said it is "scary" ;-) ). There
    is not only the problem of old books using old typefaces that may not
    be very "clear" to the OCR software, but also the problem that the
    French and English languages, and accepted spelling in these
    languages, has also changed significantly since the publication of
    many of these books.

    For example, here's a page from R.P. Louis Hennepin's "Nouvelle
    découverte d'un très grand pays situé dans l'Amérique, entre le
    Nouveau Mexique et la mer glaciale" (Chez Guillaume Broedelet, Utrect,
    1697) describing with some artistic license and self-grandeur the
    natural abundance around the Straits of Detroit (Michigan) at that
    time:

          http://www.canadiana.org/ECO/PageView?id=1ff0c1f9dfb54e32&display=35676+0207

    I believe this page has some examples of rather old French verb forms,
    and of course one sees the old form of lower case "s" which
    facilitates the use of interesting ligatures for "st" and "ss", etc.

    Having a resource like Early Canadiana Online is a fantastic boon to
    scholars and casual historical researchers alike. And I am glad that
    the original scanned images are made available because I would be very
    wary if I had access only to the OCR'ed text.

    Reflecting upon this online resource, there are a number of
    technological musings that one can ponder. For example, wouldn't it
    be neat if it were possible to search, Google-style, on a search term
    in this book --say "chasse"-- and then have the result set appear with
    this original scan of the page but now with the word "chasse"
    highlighted in color just as if we were searching a plain-text digital
    resource?

    It would be interesting to contact the folks who have brought us
    Canadiana.org and see what kinds of technical obstacles they have
    faced, and what solutions they have in the works for the future.

    -- Ed Trager

    On Nov 11, 2007 3:17 PM, John Hudson <john@tiro.ca> wrote:
    > As a type designer, I'm mainly involved in the business of making text visible and trying
    > to get it to display correctly. I'm going to be speaking at a conference in January* and
    > as background for my talk I'm interested in cataloguing the other things that people do,
    > or want to do, with computerised text. It strikes me that many of these things take place
    > in the realm in which text is 'invisible', i.e. prior to or independent of display. Some
    > of these things are obvious in a general sense (spelling and grammar checking, sorting,
    > comparing), but I'd like to come up with some specific and interesting examples --
    > particularly of a scholarly nature --, and also would like know of any other things that
    > people 'do with text' beyond displaying it. I suspect that there are things I have not
    > even imagined within my narrow focus.
    >
    > John Hudson
    >
    > * http://www.bibletechconference.com/
    >
    > --
    >
    > Tiro Typeworks www.tiro.com
    > Gulf Islands, BC tiro@tiro.com
    >
    > A bilabial velaric ingressive stop is essentially a kiss.
    > -- Pullum & Ladusaw, _Phonetic symbol guide_
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 12 2007 - 12:00:51 CST