Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Nov 06 2003 - 08:38:40 EST

  • Next message: Peter Kirk: "Re: [hebrew] Re: Hebrew composition model, with cantillation marks"

    On Wed, 5 Nov 2003 12:24:00 +0100, "Philippe Verdy" wrote:
    >
    > The obliterated character needed for paleolitic studies, or to encode any
    > texts in which the character is not recognizable already exists: isn't it
    > the REPLACEMENT CHARACTER?
    >

    The problem of how to represent missing/obliterated characters in Unicode when
    transcribing manuscript/printed texts and inscriptions, etc. has always
    perplexed me.

    U+FFFD [Replacement Character] is "used to replace an incoming character whose
    value is unknown or unrepresentable in Unicode", and is definitely not the
    correct character to use to represent a missing or obliterated character in a
    non-electronic source text.

    For Chinese the standard glyph for a missing/obliterated/unclear ideograph is a
    full-width hollow square (i.e. the same size as a CJK ideograph). This glyph is
    very common in modern printed Chinese texts, from scholarly editions of ancient
    texts unearthed from 2,000 year old tombs to popular typeset reprints of 19th
    century novels. Several examples of the usage of this glyph in modern printed
    texts from the PRC can be found at
    http://uk.geocities.com/babelstone1357/CJK/missing.html

    The problem is how to represent this glyph in electronic texts. Browsing the
    internet there seem to be two, both unsatisfactory, ways of representing this
    "missing ideograph" glyph :

    1. Using U+25A1 [WHITE SQUARE] (although any of the other white square
    graphic symbols encoded in Unicode, such as U+25A2, U+25FB or U+25FD, could also
    be used I suppose). The problems with this character are :
    a) it has the wrong character properties for use within running CJK text.
    b) with CJK fonts such as SimSun U+25A1 is rendered the same height and width as
    a CJK ideograph, but with non-Chinese fonts such as Arial Unicode MS U+25A1 may
    be rendered much smaller than a CJK ideograph, which looks totally wrong.

    2. Using U+56D7 [a CJK ideograph, rarely used other than as a radical =
    U+2F1E], which has the right character properties, and renders at the correct
    size; but the glyph shape may not be completely square depending upon the font
    style, and basically it is just the wrong character for the job.

    It would be extremely useful to have a dedicated Unicode character for "missing
    CJK ideograph" with the right character properties, and I have considered making
    a proposal for such a character, but have hesitated as if there really is such a
    great need for it (and I personally have web pages which transcribe texts with
    missing/obliterated ideographs where such a character is desperately needed)
    then why does it not already exist in Unicode or pre-existing Chinese encoding
    standards ?

    Andrew



    This archive was generated by hypermail 2.1.5 : Thu Nov 06 2003 - 09:22:30 EST