From: Kenneth Whistler (firstname.lastname@example.org)
Date: Thu Jul 17 2003 - 20:18:45 EDT
> > 282 MES-2 is specified by the following ranges of code positions as
> > indicated for each row...
Philippe Verdy asked:
> As most of these characters are canonically decomposable, shouldn't this
> list include also the decomposed characters?
> Why is row 03 so resticted? Shouldn't it include those accents and
> diacritics that are used by other characters once canonically
> decomposed? Or does it imply that MES-2 is only supposed to use
> strings if NFC form?
MES-2 (and all the rest of the Multilingual European Subsets) are
a CEN construct. See the CEN Workshop Agreement, CWA 13873:2000
posted at Michael Everson's site:
Among other things, that CWA states:
"This CWA does *not* specify any encoding of the European Subsets."
so conceptually it is more like a repertoire listing.
MES-2 is formally listed in 10646 as one of the normative subsets
there, but since 10646 has no concepts of decomposition, normalization,
or equivalence, the fact that MES-2 contains precomposed characters
but not their decompositions or the relevant combining accents
is formally irrelevant.
The Unicode Standard does not make subsets a normative construct
for that standard and doesn't even mention MES-2. Conformance to
10646 doesn't require you to make use of its subsets, but if anyone
is worried about the articulation of the standards, the Unicode
Standard itself formally consists of Subset 305 of 10646:2003,
namely the "UNICODE 4.0" subset -- the subset which contains *all*
of the encoded characters of 10646:2003.
Think of the Multilingual European Subsets as a kind of
way for people in Europe associated with standards organizations
and governments to try to communicate with software vendors
regarding which "user characters" they want to ensure are
supported by their software. The CWA 13873 contains some
questionable presuppositions about how software vendors are
actually proceeding to roll out their Unicode support, but
the intent of the CWA is clear:
"It is estimated that implementing the full character set of the
UCS may be costly in the first stages of UCS use, and that many
manufacturers will implement in subset-stages. To ensure that a
common subset usable to the vast majority of European users be
available for a reasonable price, and as a guide to manufacturers,
it will be helpful to specify, to users and procurers of systems,
European subsets of the UCS encompassing the characters for use
in European languages as well as other frequently used and
> Also, is this list under full closure with existing character properties, like
> NFKD decompositions, and case mappings?
MES-2 is clearly *not* closed under NFD, NFKD, or NFKC normalizations.
Although less obvious, it is also not closed under NFC
normalization. For example, it includes the angle brackets
U+2329, U+232A, but not their canonical equivalents,
U+3008, U+3009. There are also some characters outside the MES-2
repertoire where NFC(x) *is* in the MES-2 repertoire. Singleton canonical
equivalences like U+212B ANGSTROM SIGN come to mind, for example.
I haven't checked on case mappings and case foldings, but would
not be too surprised to find an anomaly or two there, as well.
MES-2 was not designed by the UTC, nor did it take any of
these considerations into account. It is not really an
appropriate construct for the Unicode Standard. A more
meaningful way to think of it is: if you want to sell software
in Europe, you better be able to input and display all the
characters we Europeans have in this list.
This archive was generated by hypermail 2.1.5 : Thu Jul 17 2003 - 21:12:00 EDT