Adobe-Japan1 IVS collection: single sequence for a given character

From: Eric Muller (
Date: Thu Mar 22 2007 - 12:10:07 CST

  • Next message: Asmus Freytag: "Re: Older versions of code charts online?"

    Why does Adobe-Japan1 contains a single sequence for a given character?

    The fundamental reason is that 1) CIDs are more restrictive than
    characters, and 2) our CID collection is open-ended.

    1) if you look at what is in the Adobe-Japan1 CID collection today, you
    will notice that it distinguishes shapes that are not distinguished at
    the character level. Whereas Unicode unifies two shapes that differ only
    in roof-top modification or in rotated strokes, the AJ1 CID collection
    retains those distinctions. If you think in terms of glyphs sets, a
    character is a certain set of glyphs; a CID is a subset of one such set;
    the glyphs for that CID in a given AJ-1 font family are a subset of that
    subset; the glyph for that CID in a particular face of that family are a
    subset of that. Of course, this nesting is not always very clean,
    because of duplicate encoding, and various other historical accidents,
    but it's still a useful view.

    The fact that we identify a single subset of a given character (i.e.
    have a single CID for a given character) does not mean that that subset
    contains all the glyph shapes for the character. More concretely,
    consider U+4FD8, for which we only have CID 4147: there are shapes which
    are acceptable for U+4FD8 which are not acceptable for CID 4147. In
    other words, these two things are not equivalent, so <U+4FD8> and
    <U+4FD8, U+E0100> = CID 4147 express different things. Granted, this is
    not explicitly stated in the definition of AJ1, but it is there.

    It is true that if I display today <U+4FD8> with an AJ1 font, then I
    will always get a shape that satisfies CID 4147, because that is the
    only kind of shape that can get in an AJ1 font today. But if I display
    with any Unicode font, not just an AJ1 font, <U+4FD8, U+E0100> and
    <U+4FD8> can produce different results, and I have "more guarantees"
    about the way <U+4FD8, U+E0100> will look like than I have about the way
    <U+4FD8> will look like.

    All this applies equally well to the cases where a character has
    multiple CIDs. The only difference in that case is that I can guarantee
    that two different occurrences of a given character will show up

    2) our AJ1 CID collection is open-ended, i.e. we can add CIDs to it over
    time, as the need arises. For example, suppose that JIS decides in a new
    edition to modify the shape of the acceptable glyphs for a given JIS
    code point, then we would add a CID for the new shape. Playing that in
    the past: consider the shape given in JIS 0208 :1978 to 17-28 (aka
    U+958F): we have CID 1246 for that; then :1984 comes along and changes
    the shape of 17-28; we do not redefine the shape of CID 1246, instead we
    add CID 7641. [This is reconstruction does not necessarily match what
    really happened, it's only for illustration.]

    Let's put the two together. If I want the CID 4147 guarantee but there
    is not IVS for it today, then all I can put in my document is <U+4FD8>.
    We already saw that this may or may not be displayed the way I want. I
    need to impose, by means outside my plain text, the use of an AJ1 font
    to get what I want. Not wonderful, but I could live with it. Then
    tomorrow a new CID shows up for U+4FD8, and we register two sequences,
    one for CID 4147 and one for the new CID. I can use those sequences in
    new documents, but that leaves the document I created today in the cold.
    Even if I can still enforce the use of an AJ1 font, I no longer get the
    guarantee that this lone <U+4FD8> is displayed with CID 4147. I would
    need a further guarantee that in AJ1 fonts, U+4FD8 is cmapped to CID
    4147, now *and forever*. Well, if you look at the history of font's
    cmaps, that is definitely not happening. Indeed, the change of shapes
    mandated by the JIS standards make it more or less impossible to enforce
    that "never change the cmaps", and that creates all sort of very nasty
    problems for our customers. By registering today a sequence even when it
    is the only one for a given character, we can offer our customers (and
    others) a viable and robust solution.

    As usual with variation sequences, this is not say that every occurrence
    of a character in a document should be decorated with a variation
    selector. Whether to decorate every occurrence, to decorate no
    occurrence or anywhere in between depends on what guarantees you need
    for your document. I could imagine for example that an official document
    would systematically decorate people and place names, but would
    systematically not decorate the "boilerplate".


    This archive was generated by hypermail 2.1.5 : Thu Mar 22 2007 - 12:11:54 CST