Re: visible glyphs for U+2062 and similar characters

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 21:54:54 EDT

  • Next message: John Cowan: "Re: Decimal separator with more than one character?"

    From: "Jungshik Shin" <jshin@mailaps.org>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>
    Sent: Sunday, May 18, 2003 2:50 AM
    Subject: Re: visible glyphs for U+2062 and similar characters

    >
    > On Mon, 12 May 2003, Philippe Verdy wrote:
    >
    > > Couldn't those invisible characters be allowed to be rendered with
    > > a glyph directly from a Unicode string, by using a variant selector
    > > character after the normally invisible character?
    >
    > That's one possibility for _plain_ text. For 'rich text',
    > we can have a kind of 'markup' to indicate that we want invisible to be
    > rendered with visible glyph. For instance, a new CSS property
    > can be proposed to turn the invisible to visible.

    There's another possibility, that keeps the semantics of canonical strings for interchange: it's to use in editors an assigned non-character that (can safely be removed on interchange interfaces such as file saving) as a prefix base for the following diacritic or invisible character that needs to be rendered specially (for example with a dotted box containing a symbol for invisible characters, or with a dotted circle holding the combining mark to be rendered when the combining marks appears isolated in a degenerated form (i.e. at beginning of a string or after a control or format character as specified in Unicode).

    If interchange of glyphs for combining marks must be preserved, there already existsa solution by putting the combining mark after a space, so that it can (most often) form a canonical composition, but the typical glyph displayed in this case will lack the dotted circle, and will just create a spacing mark. That's where a variant selector after the space and the non-spacing combining mark will be better, because assigned variant selectors are kept by canonical forms and interchanges, unlike assigned non-characters.

    So this could create the following framework, where NC is a privately used but assigned non-character, recognized by the rendered as a rendering behavior hint that alters the following character:

    - to display invisible characters temporarily (such as ZWNJ here):
      generate NC+ZWNJ, the renderer selects the alternate set of visible glyphs then displays the alternate dotted square glyph associated to ZWNJ, then returns to normal state. This will be saved/interchanged canonically as just ZWNJ because NC is discarded.

    - to display a glyph temporarily for combining marks (such as a combining ACCUTE accent here):
      generate NC+ACCUTE, the renderer selects the alternate set of isible glyphs then display the alternate glyph of an accute accent on a dotted circle for the following ACCUTE character then returns to normal state. This will be saved/interchanged canonically as just ZWNJ because NC is discarded. In a "Show decomposed" mode, all characters would be canonically decomposed to their NFD form before generating a NC character before all combining characters. On save/interchanges, a sequence of base+NC+combining character would be recomposed canonically as NC characters would be ignored and deleted before canonical recomposition (if needed).

    > > For me the invisibility of the character is a property of the glyph,
    > > but not of the Unicode character properties itself, which does not
    > > mandate any glyph, should it be only an invisible one with zero width.
    >
    > I'm less sure of this. Why don't you write to the list and see what
    > others think.

    Can this be demonstrated ? Is there a semantic in character properties or in some other standardized parts of Unicodeor its annexes that says that zero-width characters must not have visible glyphs, given that Unicode only show "representative glyphs" which are not normative but informative (the only prohibition seems to show a glyph that could legitimately be associated to a distinct coded character and dissociated from the original one) ?

    > > So tailoring the character with explicit variant selector to select
    > > a font variant with a known semantic would avoid requiring font GSUB
    > > lookup, which is a specific feature of OpenType fonts.
    >
    > Perhaps, I was not clear. Using GSUB feature of opentype font is just
    > one way of *implementing* either what you proposed for plain text (with
    > VS) or a new CSS property I wrote about above. As such, it's completely
    > left up to implementors as to which font technology/rendering methods
    > to use.

    I do agree, but clearly the rich text case is always internally handled to generate runs of Unicode characters that will be rendered as a string using a single interface to the font renderer. So the font renderer needs to be prepared to receive Unicode strings that match the expected glyphs for the input code units. So CSS can be a convenient way to interchange such info if needed, but does not sole the internal problem of rendering it.

    The main problem with solutions based on NC is that the font file cannot be prebuilt and interfaced to contain the associated glyphs, because it will need a way to specify how these glyphs can be mapped from an input Unicode string. As the file is interchanged, the mapping between sequences of code units and glyphs must also be interchangeable, a property not shared by NC characters.

    So unless Unicode is modified to include "control pictures" codepoints for all defined invisible characters, it will be difficult to create such interchangeable font.

    The only way to solve this problem would be to assing in Unicode a format control character that means "Show Hidden" and changes the behavior of the codepoint that follows it (it could be any codepoint, not only invisible characters, but it would have a rendering impact only on invisible characters or on some visible characters like whitespaces) and say that the "Show Hidden" has category "Cf", but is a "non-character" that can safely be dropped on interchanges or during canonical or compatible (de/re)compositions. This non-character always being used temporarily would be like an escape to the normal non-spacing/invisible/combining nature of a codepoint so that it will become spacing, visible and not combining, and displayed with a representative glyph. Its effect on itself would be void (this character will not need any representative glyph een in editors as it would always be lost during interchanges, so it would only be generated internally for the application-level rendering object, only for the purpose
    of interfacing with fonts or layout engines that would know how to interpret NC+C sequences into a visible glyph).



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 22:31:40 EDT