RE: Control picture glyphs (was Re: Apostrophes at www.unicode.org)

From: Peter Constable (petercon@microsoft.com)
Date: Sun Aug 26 2007 - 20:45:42 CDT

  • Next message: Mark Davis: "Re: Control picture glyphs (was Re: Apostrophes at www.unicode.org)"

    Another correction: I wasn’t paying close attention to “+” and “-“. White_Space and annotation characters are *excluded* from default ignorables. (I understand White_Space; I don’t quite understand why the annotation characters are excluded. Anyway…) So, I said there are 6116, but that’s off because I was adding those extra characters.

    From: unicore-bounce@unicode.org [mailto:unicore-bounce@unicode.org] On Behalf Of Peter Constable
    Sent: Sunday, August 26, 2007 12:12 PM
    To: Unicode Mailing List; UTC
    Subject: RE: Control picture glyphs (was Re: Apostrophes at www.unicode.org)

    I wrote: “The eleven 2000..200A spaces are only ten…” Please read as “The 2000..200A spaces are only eleven…”

    From: unicore-bounce@unicode.org [mailto:unicore-bounce@unicode.org] On Behalf Of Peter Constable
    Sent: Sunday, August 26, 2007 11:42 AM
    To: Unicode Mailing List; UTC
    Subject: Re: Control picture glyphs (was Re: Apostrophes at www.unicode.org)

    [I was planning to ignore a thread about apostrophes. *This* is a rather different topic. It would help if the subject field had been changed when the topic forked like this.]

    There are various reasons why a fonts might not be created as Mark suggests.

    First, various default ignorable code points have attained that status at various points over the past several years. Many fonts pre-date when a code point became ignorable.

    Second, if we consider default ignorables, there’s a very large number that would bloat fonts with little utility. Let consider them: Other_Default_Ignorable_Code_Point + Cf + Cc + Cs + Noncharacters - White_Space - annotation characters.

    Three of these groups are things that should never get painted, or at least that a font developer certainly doesn’t assume they’re responsible for:


    - 66 non-character code points

    - 65 control character (Cc)

    - 2048 surrogate code points (Cs)

    Do you seriously expect font developers to add 2179 entries into their cmap tables for this stuff?

    There are 26 white space characters. For some, I’ll agree that I’d expect them to be generally supported in fonts, but not all:


    - Six are Cc, discussed above.

    - Four of these are script or domain specific. How many vendors creating (e.g.) a Latin font are going to feel a particular need to add support in their font for MONGLIAN VOWEL SEPARATOR, for instance?

    - The SEPARATORs are two more things that should never get painted.

    - The eleven 2000..200A spaces are only ten, and not hard to deal with (once you decide how much advance to give each) Of course, none should be give zero advance as you suggested.

    - The remaining three are SPACE, NBSP, and NNBSP. Again, of course these shouldn’t have zero-advance glyphs. The first two are usually supported in fonts; I’d be inclined always to support that NNBSP as well.

    Other_Default_Ignorable_Code_Point includes 3779 code points:


    - 3774 of these are code points with the general category of Cn. (I never noticed what an odd collection this is!) Surely you don’t expect font developers to add cmap entries for all of these?

    - Four are Hangul filler characters. What font developer making a font for anything but Hangul is going to feel a particular need to support these?

    - One is CGJ. (I’ll return to this below.)

    There are 138 format control characters (Cf).


    - Eight are script-specific graphic characters. Is (e.g.) a Latin font developer expected to add the END OF AYAH to their font?

    - Six are the deprecated controls 206A..206F, which should never be painted.

    - Eleven are the annotation, beam and slur characters, which are intended for process-internal use and should never get painted.

    - 97 of them are the plane-14 tag characters. These may not be deprecated, but they are strongly discouraged. A font developer is supposed to support these?

    - There are 16 others. (I’ll return to these.)

    By the way, the code points to which variation selectors are assigned are not default ignorable code points.

    And while I’m digressing on VSs, there’s another reason a font vendor might not do what Mark suggests – this pertains to variation selectors as well: not all font vendors creating OpenType fonts *do* have the freedom to add glyphs for these to their fonts. Adobe’s CID-keyed fonts cannot support additional glyphs for variation selectors – which is a key reason why MS and Adobe have agreed on an extension to OpenType for supporting variation-selector sequences that does not require that these be mapped to glyphs. (A font developer can still map VS characters to glyphs if they want, though.)

    In total, there are (by my count) 6116 default ignorable code points. From a font-developer’s perspective, the vast majority – over 99%, are junk that should not be bothered with in a font. I certainly wouldn’t want fonts to be bloated with that stuff.

    Here’s how I’d break down the 6116 default ignorable code points from a font-development perspective:

    - Reasonable to ignore in fonts: 6069
    = non-characters (66) + Cc (65) + Cs (2048) + 2028..2029 (2) + 2064..206A (12) + FFF0..FFFB (12) + 1D173..1D17A (8) + E0000..E00FF (256) + E01F0..E0FFF (3600)

    - Generic Cf and Zs that probably makes sense to support in all fonts (only one is zero-width invisible): 16
                    = 0020, 00A0, 00AD, 2000..200B, 202F

    - Script- / domain-specific characters I’d only expect to be supported in relevant fonts: 16
                    = 0600..0603, 06DD, 070F, 115F..1160, 1680, 17B4..17B5, 180E, 205F, 3000, 3164, FFA0

    - Other format controls: 15
                    = 034F, 200C..200F, 202A..202E, 2060..2063, FEFF


    Now, that last set consists of the following:

    034F COMBINING GRAPHEME JOINER
    200C ZERO WIDTH NON-JOINER
    200D ZERO WIDTH JOINER
    200E LEFT-TO-RIGHT MARK
    200F RIGHT-TO-LEFT MARK
    202A LEFT-TO-RIGHT EMBEDDING
    202B RIGHT-TO-LEFT EMBEDDING
    202C POP DIRECTIONAL FORMATTING
    202D LEFT-TO-RIGHT OVERRIDE
    202E RIGHT-TO-LEFT OVERRIDE
    2060 WORD JOINER
    2061 FUNCTION APPLICATION
    2062 INVISIBLE TIMES
    2063 INVISIBLE SEPARATOR
    FEFF ZERO WIDTH NO-BREAK SPACE

    It is *this* set plus the variation selectors that might have been relevant for Mark’s suggestion, not default ignorable code points. Perhaps a discussion of this set in relation to font implementations would be useful.


    Peter

    From: unicore-bounce@unicode.org [mailto:unicore-bounce@unicode.org] On Behalf Of Mark Davis
    Sent: Friday, August 24, 2007 8:29 AM
    To: Doug Ewell
    Cc: Unicode Mailing List; UTC
    Subject: Re: Apostrophes at www.unicode.org

    A similar annoyance is the fact that so many fonts don't map the default-ignorable code points (like variation selectors) to a zero-width invisible glyph by default. Expecially since with True/OpenType, it is essentially free to add support for a character that has the same glyph as one you already have in the font.

    Maybe what would help would be a document aimed at font developers, which contained a list of the default mappings that they should supply

    Mark
    On 8/23/07, Doug Ewell <dewell@roadrunner.com<mailto:dewell@roadrunner.com>> wrote:
    Eric Muller <emuller at adobe dot com> wrote:

    > ... Most of the time, it should be U+2010 ‐ HYPHEN. However, the
    > support in fonts for U+2010 is less than perfect, and some users they
    > will get a .notdef glyph.

    Speaking only for myself, the poor level of support for U+2010 in many
    mainstream fonts with otherwise decent Unicode coverage is a frequent
    annoyance, and a puzzlement.

    --
    Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages



    --
    Mark



    This archive was generated by hypermail 2.1.5 : Sun Aug 26 2007 - 20:50:25 CDT