Re: EA width, Latin punctuation and fonts, Compatibility ...

From: peter_constable@sil.org
Date: Fri Dec 10 1999 - 14:39:26 EST


       Subject:
       Re: EA width, Latin punctuation and fonts, Compatibility
       Characters [CLIP AND SAVE]
       ---------------------------------
       
       
       Ken, Asmus, et al:

       KW>The term "compatibility character" seems, unfortunately to
       lead people into terminal confusion...

       Thanks for the explanation, Ken. It appears, then, that the
       punctuation characters I asked about that are in the range FF01
       - FF5E are "compatibility characters" in both senses of the
       word. The ideographic space is a compatibility character at
       least in sense 2 (has a compatibility decomposition); I'm not
       sure about sense 1.

       Equipped with this more complete understanding, though, I'm not
       sure that I'm closer to figuring out what the answer to my
       questions are.

       AF>Since Unicode 3.0 is now final, and the main text a few
       short weeks
       from being sent to the printers, the text in TR#11 (part of
       3.0)
       now supercedes anything written about this stuff in version 2.0

       I've read TR11, and tried to figure out what it tells me that's
       relevant to my question.

       <quote>
       East Asian Width... is a categorization of character. It can
       take on two abstract values, narrow and wide. In legacy
       implementations, there is often... a difference in displayed
       width. However, the actual display width of a glyph is given by
       the font and may be further adjusted by layout. An important
       class of fixed width legacy fonts contains glyphs of just two
       widths with the wider glyphs twice as wide as the narrower
       glyphs...

       East Asian FullWidth (F) - characters that are defined as FULL
       WIDTH and therefore are compatibility equivalents of implicitly
       narrow but unmarked characters elsewhere in the Unicode
       Standard...

       East Asian Wide (W) - characters that are implicitly wide (such
       as the Unified Han Ideographs or Squared Katakana Symbols)
       because they occur only in the context of East Asian typography
       where they are wide characters.

       East Asian Ambiguous (A) - characters that occur in East Asian
       legacy character sets as wide characters, but are displayed as
       narrow (i.e. normal-width) characters in their own local or
       non-East Asian usage (Examples are the Greek and Cyrillic
       alphabet found in East Asian character sets, but also some of
       the mathematical symbols). Ambiguous characters require context
       to resolve their width. Private Use characters are considered
       ambiguous, since additional information is required to know
       whether they should be treated as wide or narrow.
       </quote>

       First of all, it's confusing to read that this categorization
       "can take on two abstract values, narrow and wide" when there
       are six categories that are subsequently defined. Nonetheless,
       if I look at the six categories and consult the data files, I
       learn that

       - U+3000 is wide
       - U+FF01 - FF5E are fullwidth
       - some of the other punctation that's relevant, e.g. quotation
       marks, are ambiguous

       The definition for East Asian Wide doesn't tell us anything
       about whether U+3000 is a compatibility equivalent of U+0020,
       but we already know that it is so in sense 2 (and possibly
       sense 1). (Note: TR11 came up in relation to questions about
       U+3000, not U+FF01 - FF5E.)

       By applying the definition for East Asian FullWidth, I know
       that FF01 - FF5E are "compatibility equivalents of implicitly
       narrow but unmarked characters". This TR doesn't say which
       sense of "compatibility" is meant here, but we already know
       that these characters were compatibility characters in both
       senses of the word anyway (see above).

       The definition for East Asian Ambiguous suggests to me that,
       for quotation marks, the preferred solution is option 3 or 4
       (from my original posting), and that I shouldn't encode using
       U+301D and U+301E to access glyphs for wide quotation marks.

       **Is that conclusion right?**

       Except for quotation marks, it doesn't seem to me that
       consulting the definitions in TR11 has gotten me any closer to
       knowing how to implement.

       Reading on in section 6, "Recommendation" (in the second half -
       the stuff on mapping to legacy encodings isn't relevant), most
       of what is said tells me that, if certain characters are to be
       used, then glyphs in fixed pitch fonts used to display the
       characters should have certain widths. The point about ambigous
       characters

       <quote>
       Ambiguous characters behave like wide or narrow characters
       depending on context (language tag, associated font, source of
       data, or explicit markup; all can provide the context)
       </quote>

       seems to require option 3 or 4. Otherwise, none of this seems
       to tell me much about the answers to my questions.

       AF>If the Chinese are creating Yi legacy style character sets
       with the punctuation mapped to the FullWidth characters, I see
       no reason why you should not follow their lead.

       As far as I know, there is one proprietary system in existence
       (a DTP app) created by some developer in China for working with
       Yi text. I don't know any of the details of this, but it
       certainly is not my impression that this has been adopted in
       any official sense as a character set standard, or even as a de
       facto standard. Based on what I'm aware of, I'm assuming that
       legacy character sets are not relevant.

       AF>On the more general question about what to do when you have
       two scripts sharing Latin punctuation, but needing more or less
       subtle adjustments in shape depending on context. For these I
       am in favor of using meta data (font tags or language tags) to
       select the correct glyphs...

       This makes it sound to me like you favour my option 4.

       AF>I'm personally in favor of encoding compatibility characters
       when you want to make distinctions that are maintained in
       equivalent documents using legacy character sets.

       As mentioned above, I'm working with the assumption that legacy
       character sets aren't relevant. (Maybe that's not a valid
       assumption if a lot of people are using the DTP app I mentioned
       - however "a lot" is to be defined.)

       AF>That is, these characters, in my opinion, are not solely
       there for transparent use in character interchange with a
       Unicode pivot, but to provide for a stable collection of
       mutually unique code positions for a given market, whether or
       not one happens to work in Unicode from start to finish, or
       uses legacy character codes for part of the process. **I'm not
       a strong believer in sometimes usign meta data and sometimes
       using character codes for expressing the same distinction
       between conceptually same pairs of characters. That is error
       prone and will lead to confusion sooner or later.**

       (emphasis mine)

       The last two sentences of this suggest to me that you might
       favour option 1 in this case, or at least that, if we adopt
       option 1 (include both wide and narrow glyphs in a single font
       and encode text using compatibility - there's that word again -
       characters) as an immediate solution, then it's probably best
       for people to stick with option 1.

       That seems to me to beg a question: If SIL ships a font (it'll
       be freeware, by the way) that assumes people encode using this
       option, will we be doing anybody a disservice?

       So, Ken's and Asmus' comments have been helpful, but I'm still
       not sure how to answer the questions I raised. And I've raised
       at least one new question in this message. And I'm still
       waiting for any responses regarding the "fat period" and
       vertically-centered ellipsis.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT