Unicode 3.2 comments - part 1

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Fri Jan 25 2002 - 03:51:34 EST


-----BEGIN PGP SIGNED MESSAGE-----

Grapheme breaking

  There are two errors in the grapheme breaking rules:

  1. In the rule to prevent breaking CRLF, 'not CR' is used instead
     of 'CR'.
  2. There are no rules for "Extend × Extend" and "Extend × Link".
     That would cause breaks within combining sequences, and between a
     combining sequence and GRAPHEME JOINER, for example.

  Here is a minimally corrected version of the existing rules:

                CR × LF

              Base × Extend
            Extend × Extend
              Link × Base
              Link × Join_Control Base
              Base × Link
            Extend × Link

                 L × (L / V / LV / LVT)
          (LV / V) × (V / T)
         (LVT / T) × T

               Any ÷

  [6 of these rules can alternatively be written as:

   (Base / Extend) × (Extend / Link)
              Link × [Join_Control] Base
  ]

  However, I suggest using the following rules instead, where
  Precede = Join_Control:

                CR × LF

           Precede × Precede
           Precede × Base
              Base × Extend
            Extend × Extend
              Link × Precede
              Link × Base
              Base × Link
            Extend × Link

                 L × (L / V / LV / LVT)
          (LV / V) × (V / T)
         (LVT / T) × T

               Any ÷

  [8 of these rules can alternatively be written as:

   (Base / Extend) × (Extend / Link)
  (Link / Precede) × (Precede / Base)

  Note that the "Link × Join_Control Base" rule is implemented instead
  by "Link × Precede" and "Precede × Base". This has the side effect
  that more than one Join_Control can appear between the Link and Base,
  but that should make no practical difference.]

  There are two differences in behaviour as a result of the modified rules:

   a) a sequence of join controls is considered to belong to the grapheme
      cluster that follows them.
   b) scripts that have characters that combine with the following
      base character, like logical-order encoding of Tengwar, can be
      supported by adding those characters to 'Precede'.

  a) means that there are no 'invisible' grapheme clusters as a result
  of join controls. This means that additional arrow keystrokes are not
  needed to step over join controls, and that join controls are
  deleted when the grapheme that follows them is deleted.

  (Of course, an editor could have a mode that makes normally invisible
  controls visible; in that case they would be treated like base characters
  for grapheme breaking.)

  There can still be invisible grapheme clusters as a result of other
  characters in the set 'Default_Ignorable_Code_Point'; those should
  all be looked at more closely, to see whether it would be better to
  put some of them in the Extend or Precede categories.

  For example, when language tagging is used, it makes no sense to treat
  each tag character as an invisible grapheme, so the tag characters
  should probably be in 'Precede' (since the apply to, and should be
  deleted with, the text that follows them).

Grapheme breaking for variation selectors

  The Mongolian and generic variation selectors are not listed in
  Grapheme_Extend, in DerivedCoreProperties-3.2.0d4.txt.

  They should be listed, because they are category Mn, and not in
  Grapheme_Link, but Grapheme_Extend is supposed to have been generated
  as 'Me + Mn + Mc + Other_Grapheme_Extend - Grapheme_Link'. Possibly
  the derived properties have not been properly regenerated?

Hangul Jamo ranges

  For grapheme breaking and in the definition of a standard Hangul
  syllable, the definitions of Hangul Jamo types are given as:

    L = U+1100..115F
    V = U+1160..11A2
    T = U+11A8..11F9

  I think V should be defined as U+1160..11A7, and T as U+11A8..11FF,
  i.e. including unassigned characters (note that the range for L
  already includes some unassigned characters). That would allow a
  small number of new conjoining Jamo of any type to be assigned,
  without changes to implementations of grapheme breaking or Hangul
  filler "standardisation".

Variation Selectors

  Suppose a character C has variants, but is encoded without any
  variation selector; call this "plain C". It is not clear whether
  plain C is required to be shown using the reference glyph.

  If plain C may be shown as any variant, then there is no way to
  explicitly specify use of the reference glyph. That is undesirable
  for both the math variants and the Mongolian variants.

  Therefore the standard should say that a plain character that has
  variants should always be displayed with the reference glyph
  (up to style differences). This is not unnecessarily strict, and
  mathematical Unicode fonts will need to be changed anyway to
  support the variant selector mechanism properly.

  An alternative, for the mathematical variants, would be to make
  VS1 specify what is now the reference glyph, and VS2 specify the
  alternate glyph. The plain character would mean either. However,
  that has several disadvantages:

   - "C" and "C + VS1" would be almost always indistinguishable, but
     not canonically equivalent.
   - users who want the reference glyphs would need to make sure that
     VS1 was added, but would have no obvious way of verifying whether
     it actually has been added, short of displaying control characters.

Names of properties

  Change the name of the "Radical" property to "CJK_Radical", since it
  does not include Yi radicals, and presumably would not include any
  future non-Han/CJK radicals.

  "Default_Ignorable_Code_Point" is quite awkward; just "Default_Ignorable"
  would be better. Unassigned characters are still characters, not code
  points.

U+03F6 GREEK REVERSED LUNATE EPSILON SYMBOL

  This character should have category Ll, BiDi class L, and
  script GREEK, for consistency with the other Greek math/technical
  symbols.

New math characters

  Consider the new math characters with names that match:

    * WITH DOT ABOVE|BELOW (U+0307,0323)
    * WITH TILDE ABOVE|BELOW (U+0303,0330)
    * WITH CIRCUMFLEX ACCENT [ABOVE] (U+0302)
    * WITH COMMA ABOVE (U+0313)
    * WITH LEFT|RIGHT ARROW ABOVE (U+20D6,20D7)
    * WITH FOUR DOTS ABOVE (U+20DC)
    * WITH PLUS SIGN BELOW (U+031F)

  All of these can be represented by combining sequences using the
  characters on the right. (There are more if you count stroke
  overlays, or unify overbar/underbar with macron; I only included
  the cases that are definitely equivalent.)

  However, none of these decompositions are given in the UnicodeData
  file - why not?

  I realise that they would need to have composition exclusions, and
  so would end up being decomposed in normalised contexts. That is
  a good argument for not encoding these characters at all (mathematical
  Unicode fonts are supposed to support high quality composition anyway).
  However, if it is too late to stop them being encoded, they should
  definitely have canonical equivalences, since there is neither a
  semantic nor a visual distinction between the composed and decomposed
  forms.

I've not finished yet... More tomorrow about problems with the
conformance chapter, Hangul, and case folding.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPFEbazkCAxeYt5gVAQE0Twf+J4HW19NzchdQAAaiZqsviD0BQAMlm6O4
mseuSPUu5xGyTfEKDEGLrJhnzwlVj+HiXoJnBwksbHpaxamR0izPKeG4bke/X4pT
z9+lzimU8CYpI+JIn+aw83GtQOu5do1GgZbeSVA8LPGga7PMct3bvhkFcCJg2j/R
+Y1Jb80v0BkXRv4j7Fl8sOuilvqFHhllZ53TwOPBHwmV4QaI5n+ZXCI6LrI8ZeMF
XPNrWQCQuCb12WVlX7FRiuJBKXDtXIQaCmTbwXFoveyViq8NfaOO/2t9ePYvieit
9R+JFr9hFlc0b+cwrKDRSi3j8MiwmfU4jaouCkWIeUZ8FxIrxdwZpA==
=BeTj
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Sat Jan 26 2002 - 05:29:26 EST