L2/01-223

From: "Michel Suignard" <michelsu@microsoft.com>
23 May 2001

New revised text:
------------------

Usage of character set in East Asia has built a strong legacy which is enshrined in the usage of characters based on their original character code. Characters used in that context can be divided in the following categories:

1. script letters that unequivocally are narrow, western, don't participate in any CJK logic (ex ASCII letter)
2. symbols that unequivocally are narrow (ex ASCII symbols)
3. symbols that always participate in CJK logic in various subtle ways

There is a fourth category:
4. script letters that may participate in CJK logic (but my take on this is that this has been largely deprecated based on personal survey, more on this on following text)

The problem area is obviously to distinguish between 2 and 3. In trying to find a way to illustrate in a dramatic way how these 2 categories are different I found a fairly decent one, which is to invoke the vertical layout flow on East Asian fonts (with Asmus' Unibook tool, you just have to select a font with the '@' prefix to see the effect). There is a good correlation between the fact that a symbol belongs to 3 and the fact that in vertical flow mode it will be upright (it gets a bit more complicated for bracket and parentheses, but those typically belongs to symbol groups where the other characters would be drawn accordingly to a vertical flow layout).

The characters in the category 3, along with standard CJK wide characters (ideographs, Hangul, Jamo, etc...) participate in CJK typography rules in the following ways:

- algorithmic kerning, that is blank space within their advance width will be removed in some precise situation (this also know as Character Space Control or CSC by some East Asian experts)
- start of line removal of the same blank space
- end of line removal (hanging punctuation)
- removal or addition of advance width on those 'blank' portion of advance width when doing line justification
- various glyph adjustment within the bounding box when going from horizontal to vertical layout flow.
- baseline alignment (ideographic instead of roman baseline in horizontal flow, and center in vertical flow), note that the difference in baseline alignment strategy typically implies a specific glyph

None of the characters in category 2 (narrow symbols) are affected by these effects. For example in vertical flow they would be on their side,
and no algorithmic kerning would ever remove blank space from them (the white space chars are an exception, but this is beside the point here)
This makes the determination of cat 2 and cat 3 very crucial, and as I said it is based mostly on East Asian typography experience. I can categorize these cat3 characters as follows:

Any symbols encoded in the following block by East Asian fonts:

2000-206F General punctuation
2100-214F Letterlike symbols
2460-24FF Enclosed alpha
25A0-25FF Geometric shapes
2600-267F Miscellanous symbols
3000-303F CJK Symbols
3200-33FF Enclosed, CJK Compat
FE30-FE6F CJK Compat forms, Small form variants
FF00-FFEF Half and Full Width Forms

I have found a slight deviation in Taiwanese fonts where they also
categorized the following as cat3:
2500-257F Box drawing

As you can see, the situation is not already pretty as a mixed flow
layout containing a math expression could produce an horrendous layout.
However the impact of these interpretation by East Asian typography
process is limited by some factors:

- the bulk of the math symbols is not part of these cat3 characters
- the categorization splits go by block
- space adjustments concerns only specific characters and those characters are typically only used in CJK context. However the glaring exceptions are the bracket/parenthesis characters

From this you can see that the bracket unification between the 2329-232A and 3008-3009 is devastating as suddenly you get math characters from a non cat3 block transformed in some of the characters that are more influenced by East Asian typography, that is, they are sensitive to:

- special rules about algorithmic kerning
- shape differently depending on flow layout
- line breaking rules, etc...

Concerning the script letters that in the past participated in CJK logic (these were mostly incomplete subset of Greek and Cyrillic) their usage didn't survive the full screen terminal mode of yesterday. If you look at modern East Asian fonts, all these letters are variable length, do not get upright on vertical flow and do not get involved in East Asian typography as 'Wide' characters. So, although they may appear in these fonts, they really do not behave differently than if they were included
in a Western font. And typically because their hinting is not as good as true Western fonts, they very often get swapped in favor of the later at rendering time. So despite the fact that more and more Latin and possibly other 'narrow' scripts are showing in East Asian repertoire, it doesn't really mean that they are ever treated as 'wide'. So there is no need to treat them as ambiguous.

This really means we should concentrate on symbols concerning width ambiguity, not letters.

The last point I would like to mention is the complexity of the disambiguation. In tightly controlled document environment (Microsoft
Office is a good example), typically the language and other locale infos are well known within the context of the text and can be used to
successfully determine the Narrow or Wide nature of ambiguous symbols. But on the Web context, the locale info is often missing or even worse
incorrect. So amiguity is really bad on that context.

What I am getting to, is that the last thing we want to see is more ambiguity. The current symbol ambiguity is already bad but is contrained
to some Unicode blocks. Opening it to the whole math bracket repertoire would make a bad situation much worse.

Michel