-----BEGIN PGP SIGNED MESSAGE-----
Grapheme breaking
There are two errors in the grapheme breaking rules:
1. In the rule to prevent breaking CRLF, 'not CR' is used instead
of 'CR'.
2. There are no rules for "Extend × Extend" and "Extend × Link".
That would cause breaks within combining sequences, and between a
combining sequence and GRAPHEME JOINER, for example.
Here is a minimally corrected version of the existing rules:
CR × LF
Base × Extend
Extend × Extend
Link × Base
Link × Join_Control Base
Base × Link
Extend × Link
L × (L / V / LV / LVT)
(LV / V) × (V / T)
(LVT / T) × T
Any ÷
[6 of these rules can alternatively be written as:
(Base / Extend) × (Extend / Link)
Link × [Join_Control] Base
]
However, I suggest using the following rules instead, where
Precede = Join_Control:
CR × LF
Precede × Precede
Precede × Base
Base × Extend
Extend × Extend
Link × Precede
Link × Base
Base × Link
Extend × Link
L × (L / V / LV / LVT)
(LV / V) × (V / T)
(LVT / T) × T
Any ÷
[8 of these rules can alternatively be written as:
(Base / Extend) × (Extend / Link)
(Link / Precede) × (Precede / Base)
Note that the "Link × Join_Control Base" rule is implemented instead
by "Link × Precede" and "Precede × Base". This has the side effect
that more than one Join_Control can appear between the Link and Base,
but that should make no practical difference.]
There are two differences in behaviour as a result of the modified rules:
a) a sequence of join controls is considered to belong to the grapheme
cluster that follows them.
b) scripts that have characters that combine with the following
base character, like logical-order encoding of Tengwar, can be
supported by adding those characters to 'Precede'.
a) means that there are no 'invisible' grapheme clusters as a result
of join controls. This means that additional arrow keystrokes are not
needed to step over join controls, and that join controls are
deleted when the grapheme that follows them is deleted.
(Of course, an editor could have a mode that makes normally invisible
controls visible; in that case they would be treated like base characters
for grapheme breaking.)
There can still be invisible grapheme clusters as a result of other
characters in the set 'Default_Ignorable_Code_Point'; those should
all be looked at more closely, to see whether it would be better to
put some of them in the Extend or Precede categories.
For example, when language tagging is used, it makes no sense to treat
each tag character as an invisible grapheme, so the tag characters
should probably be in 'Precede' (since the apply to, and should be
deleted with, the text that follows them).
Grapheme breaking for variation selectors
The Mongolian and generic variation selectors are not listed in
Grapheme_Extend, in DerivedCoreProperties-3.2.0d4.txt.
They should be listed, because they are category Mn, and not in
Grapheme_Link, but Grapheme_Extend is supposed to have been generated
as 'Me + Mn + Mc + Other_Grapheme_Extend - Grapheme_Link'. Possibly
the derived properties have not been properly regenerated?
Hangul Jamo ranges
For grapheme breaking and in the definition of a standard Hangul
syllable, the definitions of Hangul Jamo types are given as:
L = U+1100..115F
V = U+1160..11A2
T = U+11A8..11F9
I think V should be defined as U+1160..11A7, and T as U+11A8..11FF,
i.e. including unassigned characters (note that the range for L
already includes some unassigned characters). That would allow a
small number of new conjoining Jamo of any type to be assigned,
without changes to implementations of grapheme breaking or Hangul
filler "standardisation".
Variation Selectors
Suppose a character C has variants, but is encoded without any
variation selector; call this "plain C". It is not clear whether
plain C is required to be shown using the reference glyph.
If plain C may be shown as any variant, then there is no way to
explicitly specify use of the reference glyph. That is undesirable
for both the math variants and the Mongolian variants.
Therefore the standard should say that a plain character that has
variants should always be displayed with the reference glyph
(up to style differences). This is not unnecessarily strict, and
mathematical Unicode fonts will need to be changed anyway to
support the variant selector mechanism properly.
An alternative, for the mathematical variants, would be to make
VS1 specify what is now the reference glyph, and VS2 specify the
alternate glyph. The plain character would mean either. However,
that has several disadvantages:
- "C" and "C + VS1" would be almost always indistinguishable, but
not canonically equivalent.
- users who want the reference glyphs would need to make sure that
VS1 was added, but would have no obvious way of verifying whether
it actually has been added, short of displaying control characters.
Names of properties
Change the name of the "Radical" property to "CJK_Radical", since it
does not include Yi radicals, and presumably would not include any
future non-Han/CJK radicals.
"Default_Ignorable_Code_Point" is quite awkward; just "Default_Ignorable"
would be better. Unassigned characters are still characters, not code
points.
U+03F6 GREEK REVERSED LUNATE EPSILON SYMBOL
This character should have category Ll, BiDi class L, and
script GREEK, for consistency with the other Greek math/technical
symbols.
New math characters
Consider the new math characters with names that match:
* WITH DOT ABOVE|BELOW (U+0307,0323)
* WITH TILDE ABOVE|BELOW (U+0303,0330)
* WITH CIRCUMFLEX ACCENT [ABOVE] (U+0302)
* WITH COMMA ABOVE (U+0313)
* WITH LEFT|RIGHT ARROW ABOVE (U+20D6,20D7)
* WITH FOUR DOTS ABOVE (U+20DC)
* WITH PLUS SIGN BELOW (U+031F)
All of these can be represented by combining sequences using the
characters on the right. (There are more if you count stroke
overlays, or unify overbar/underbar with macron; I only included
the cases that are definitely equivalent.)
However, none of these decompositions are given in the UnicodeData
file - why not?
I realise that they would need to have composition exclusions, and
so would end up being decomposed in normalised contexts. That is
a good argument for not encoding these characters at all (mathematical
Unicode fonts are supposed to support high quality composition anyway).
However, if it is too late to stop them being encoded, they should
definitely have canonical equivalences, since there is neither a
semantic nor a visual distinction between the composed and decomposed
forms.
I've not finished yet... More tomorrow about problems with the
conformance chapter, Hangul, and case folding.
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBPFEbazkCAxeYt5gVAQE0Twf+J4HW19NzchdQAAaiZqsviD0BQAMlm6O4
mseuSPUu5xGyTfEKDEGLrJhnzwlVj+HiXoJnBwksbHpaxamR0izPKeG4bke/X4pT
z9+lzimU8CYpI+JIn+aw83GtQOu5do1GgZbeSVA8LPGga7PMct3bvhkFcCJg2j/R
+Y1Jb80v0BkXRv4j7Fl8sOuilvqFHhllZ53TwOPBHwmV4QaI5n+ZXCI6LrI8ZeMF
XPNrWQCQuCb12WVlX7FRiuJBKXDtXIQaCmTbwXFoveyViq8NfaOO/2t9ePYvieit
9R+JFr9hFlc0b+cwrKDRSi3j8MiwmfU4jaouCkWIeUZ8FxIrxdwZpA==
=BeTj
-----END PGP SIGNATURE-----
This archive was generated by hypermail 2.1.2 : Sat Jan 26 2002 - 05:29:26 EST