Re: identifying greek characters in an old book

From: Asmus Freytag (
Date: Wed Oct 19 2005 - 03:17:36 CST

  • Next message: Raymond Mercier: "Re: identifying greek characters in an old book"

    On 10/18/2005 9:00 PM, Christopher Fynn wrote:

    > Raymond Mercier wrote:
    >> Unicode is meant for the printed text, is it
    >> not ?
    > Not really - at least not without an additional level of markup or
    > formatting. Unicode is specifically meant for *plain* text. Printed
    > text kind of implies formatting or rich text.
    Unicode is meant for unambiguously representing text content on computers.

    The vast majority of computerized texts are indeed
    computer-representations of printed material, or material that can be
    rendered using the same typography as printed material.

    The distinction that Raymond is aiming at, between texts that use the
    (typically more settled) typography of printed materials and texts that
    show the much wider variations common to manuscripts is a valid one.

    The representation of actual printed documents does of course require
    additional formatting information. At the minimum, it would have to
    include a font style, a font size, line-spacing and margin information.
    While true, this is not what's interesting in this context.

    The question here is how to deal with the representation of variable
    appearance of what otherwise would be the 'same' text. Where these are
    fully regular, as in selecting language or script specific forms or
    punctuation, or selecting positional forms for Arabic shaping, or fully
    defined by rules of typography, like ligatures in many (but not all)
    languages, deferring to the rendering or display engine (together with
    some overall style information) is clearly the right thing.

    For isolated variants, the UTC has consistently supported the addition
    of explicit character codes, as opposed to requiring the use of some
    generic character code with markup for variant selection. Such markup is
    not really generic and acts more like a code extension mechanism (for
    example, entity definitions in HTML). That raises portability issues and
    issues of semantic processing of text. Therefore, avoiding such markup
    is clearly the right thing.

    Limiting this support to forms attested in print is pragmatic: the
    number of variants are much smaller, and their use and appearance is
    much more settled than for manuscripts. Beyond the variations in
    particular forms, manuscripts may exhibit many other variations (in line
    width, line spacing, etc. etc) that may or may not be need to be modeled
    when a particular text is computerized for a particular purpose.

    Even if it was a better solution to support such modeling directly in
    the Unicode Standard (and it isn't) it would present the problem that
    the standardization process might well not be able to cope with the pace
    in which exceptional documents are likely to be discovered, which would
    require additional support.

    Making the pragmatic choice that the printed tradition represents
    sufficiently generic sets of variation to match the task of
    standardization is what Raymond had in mind.


    This archive was generated by hypermail 2.1.5 : Wed Oct 19 2005 - 03:19:13 CST