L2/06-399

Subject: Script property for U+3200..U+33FF
Source: Eric Muller, Adobe Systems
Date: November 24, 2006


The script property for the characters U+3200..U+33FF, i.e. the blocks
"Enclosed CJK Letters and Months" and "CJK Compatibility block" seems to
be inconsistent. The purpose of the document is to propose to fix 
that inconsistency.

First, consider:

3251..325F  Common  Circled numbers
32B1..32BF  Common  Circled numbers

Since the circled numbers are fundamentally based on the Latin digits, it
is appropriate to give them the same script at the Latin digits, i.e. 
Common.

327F        Common  Symbol (KOREAN STANDARD SYMBOL)

This is really just a symbol (without a compatibility decomposition)
so the script Common is appropriate.

The inconsistency is really among the remaining characters. They all
share the common pattern of being fundamentally formed from
script-specific characters (Latin, Katakana, Han, or Hangul), arranged
or decorated somehow. Furthermore, they all have a compatibility
decomposition involving these script-specific characters. The
inconsistency is that some are given the same script as those
constituent characters, while others are given the script Common:

Hangul parts:

3200..320D  Hangul  Parenthesized Hangul elements
320E..321C  Hangul  Parenthesized Hangul syllables
321D..321E  Hangul  Parenthesized Korean words
3260..326D  Hangul  Circled Hangul elements
326E..327B  Hangul  Circled Hangul syllables
327C..327D  Hangul  Circled Korean words

327E        Common  Circled Hangul syllable

Han parts:

3220..3243  Common  Parenthesized ideographs
3280..32B0  Common  Circled ideographs
337B..337F  Common  Japanese era names
337F        Common  Japanese corporation
32C0..32CB  Common  Telegraph symbols for months
3358..3370  Common  Telegraph symbols for hours
33E0..33FE  Common  Telegraph symbols for days

Katakana parts:

32D0..32FE  Common  Circled Katakana
3300..3357  Common  Squared Katana words

Latin parts:

3250        Common  Squared Latin abbreviation
32CC..32CF  Common  Squared Latin abbreviations
3371..337A  Common  Squared Latin Abbreviations
3380..33DF  Common  Squared Latin abbreviations
33FF        Common  Squared Latin abbreviation

The only distinction which could be made among these characters is
that the Telegraph symbols also incorporate Latin digits, in addition
to Han characters, but I do not view this as significant.

One can view those characters primarily as symbols, or primarily as
ordinary text with stylistic constraints. Accordingly, this leads to
two ways of resolving the inconsistency.

  Proposal A: the characters which currently have the Hangul script 
  should be changed to have the Common script.

  Proposal B: the characters which currently have the Common script
  and "contain" of script-specific parts should be changed to have
  the script of their parts. (This excludes the circled numbers and
  the KOREAN STANDARD SYMBOL)

I personally think that both points of view are equally valid, and
that we need to bring considerations of implementation to make the
call:

- in rendering systems that process separately runs of different
  scripts (with "Common" resolved to some "ambient script", much like
  bidi resolves neutral characters), there is virtually no possibility
  of typographic interaction at the run boundaries, e.g. no
  possibility of ligatures or kerning. Thus there would be no
  possibility of kerning between, say, a squared latin abbreviation
  and a following non-Latin, script-specific character.

- the representation of the Unicode data can be more compact if there
  are large runs of successive code points that share the same
  property value. This tends to be particularly important in small
  devices like mobile phones.

Neither consideration is very strong, but they are enough to tip my choice
toward proposal B.

---