Re: EA width, Latin punctuation and fonts, Compatibility
Characters [CLIP AND SAVE]
Ken, Asmus, et al:
KW>The term "compatibility character" seems, unfortunately to
lead people into terminal confusion...
Thanks for the explanation, Ken. It appears, then, that the
punctuation characters I asked about that are in the range FF01
- FF5E are "compatibility characters" in both senses of the
word. The ideographic space is a compatibility character at
least in sense 2 (has a compatibility decomposition); I'm not
sure about sense 1.
Equipped with this more complete understanding, though, I'm not
sure that I'm closer to figuring out what the answer to my
AF>Since Unicode 3.0 is now final, and the main text a few
from being sent to the printers, the text in TR#11 (part of
now supercedes anything written about this stuff in version 2.0
I've read TR11, and tried to figure out what it tells me that's
relevant to my question.
East Asian Width... is a categorization of character. It can
take on two abstract values, narrow and wide. In legacy
implementations, there is often... a difference in displayed
width. However, the actual display width of a glyph is given by
the font and may be further adjusted by layout. An important
class of fixed width legacy fonts contains glyphs of just two
widths with the wider glyphs twice as wide as the narrower
East Asian FullWidth (F) - characters that are defined as FULL
WIDTH and therefore are compatibility equivalents of implicitly
narrow but unmarked characters elsewhere in the Unicode
East Asian Wide (W) - characters that are implicitly wide (such
as the Unified Han Ideographs or Squared Katakana Symbols)
because they occur only in the context of East Asian typography
where they are wide characters.
East Asian Ambiguous (A) - characters that occur in East Asian
legacy character sets as wide characters, but are displayed as
narrow (i.e. normal-width) characters in their own local or
non-East Asian usage (Examples are the Greek and Cyrillic
alphabet found in East Asian character sets, but also some of
the mathematical symbols). Ambiguous characters require context
to resolve their width. Private Use characters are considered
ambiguous, since additional information is required to know
whether they should be treated as wide or narrow.
First of all, it's confusing to read that this categorization
"can take on two abstract values, narrow and wide" when there
are six categories that are subsequently defined. Nonetheless,
if I look at the six categories and consult the data files, I
- U+3000 is wide
- U+FF01 - FF5E are fullwidth
- some of the other punctation that's relevant, e.g. quotation
marks, are ambiguous
The definition for East Asian Wide doesn't tell us anything
about whether U+3000 is a compatibility equivalent of U+0020,
but we already know that it is so in sense 2 (and possibly
sense 1). (Note: TR11 came up in relation to questions about
U+3000, not U+FF01 - FF5E.)
By applying the definition for East Asian FullWidth, I know
that FF01 - FF5E are "compatibility equivalents of implicitly
narrow but unmarked characters". This TR doesn't say which
sense of "compatibility" is meant here, but we already know
that these characters were compatibility characters in both
senses of the word anyway (see above).
The definition for East Asian Ambiguous suggests to me that,
for quotation marks, the preferred solution is option 3 or 4
(from my original posting), and that I shouldn't encode using
U+301D and U+301E to access glyphs for wide quotation marks.
**Is that conclusion right?**
Except for quotation marks, it doesn't seem to me that
consulting the definitions in TR11 has gotten me any closer to
knowing how to implement.
Reading on in section 6, "Recommendation" (in the second half -
the stuff on mapping to legacy encodings isn't relevant), most
of what is said tells me that, if certain characters are to be
used, then glyphs in fixed pitch fonts used to display the
characters should have certain widths. The point about ambigous
Ambiguous characters behave like wide or narrow characters
depending on context (language tag, associated font, source of
data, or explicit markup; all can provide the context)
seems to require option 3 or 4. Otherwise, none of this seems
to tell me much about the answers to my questions.
AF>If the Chinese are creating Yi legacy style character sets
with the punctuation mapped to the FullWidth characters, I see
no reason why you should not follow their lead.
As far as I know, there is one proprietary system in existence
(a DTP app) created by some developer in China for working with
Yi text. I don't know any of the details of this, but it
certainly is not my impression that this has been adopted in
any official sense as a character set standard, or even as a de
facto standard. Based on what I'm aware of, I'm assuming that
legacy character sets are not relevant.
AF>On the more general question about what to do when you have
two scripts sharing Latin punctuation, but needing more or less
subtle adjustments in shape depending on context. For these I
am in favor of using meta data (font tags or language tags) to
select the correct glyphs...
This makes it sound to me like you favour my option 4.
AF>I'm personally in favor of encoding compatibility characters
when you want to make distinctions that are maintained in
equivalent documents using legacy character sets.
As mentioned above, I'm working with the assumption that legacy
character sets aren't relevant. (Maybe that's not a valid
assumption if a lot of people are using the DTP app I mentioned
- however "a lot" is to be defined.)
AF>That is, these characters, in my opinion, are not solely
there for transparent use in character interchange with a
Unicode pivot, but to provide for a stable collection of
mutually unique code positions for a given market, whether or
not one happens to work in Unicode from start to finish, or
uses legacy character codes for part of the process. **I'm not
a strong believer in sometimes usign meta data and sometimes
using character codes for expressing the same distinction
between conceptually same pairs of characters. That is error
prone and will lead to confusion sooner or later.**
The last two sentences of this suggest to me that you might
favour option 1 in this case, or at least that, if we adopt
option 1 (include both wide and narrow glyphs in a single font
and encode text using compatibility - there's that word again -
characters) as an immediate solution, then it's probably best
for people to stick with option 1.
That seems to me to beg a question: If SIL ships a font (it'll
be freeware, by the way) that assumes people encode using this
option, will we be doing anybody a disservice?
So, Ken's and Asmus' comments have been helpful, but I'm still
not sure how to answer the questions I raised. And I've raised
at least one new question in this message. And I'm still
waiting for any responses regarding the "fat period" and
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:56 EDT