Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 17 2003 - 20:18:45 EDT

  • Next message: Philippe Verdy: "Re: About the European MES-2 subset (was: PUA Audio Description, Subtitle, Signing)"

    > > 282 MES-2 is specified by the following ranges of code positions as
    > > indicated for each row...

    Philippe Verdy asked:

    > As most of these characters are canonically decomposable, shouldn't this
    > list include also the decomposed characters?
    >
    > Why is row 03 so resticted? Shouldn't it include those accents and
    > diacritics that are used by other characters once canonically
    > decomposed? Or does it imply that MES-2 is only supposed to use
    > strings if NFC form?

    MES-2 (and all the rest of the Multilingual European Subsets) are
    a CEN construct. See the CEN Workshop Agreement, CWA 13873:2000
    posted at Michael Everson's site:

    http://www.evertype.com/standards/iso10646/pdf/cwa13873.pdf

    Among other things, that CWA states:

    "This CWA does *not* specify any encoding of the European Subsets."

    so conceptually it is more like a repertoire listing.

    MES-2 is formally listed in 10646 as one of the normative subsets
    there, but since 10646 has no concepts of decomposition, normalization,
    or equivalence, the fact that MES-2 contains precomposed characters
    but not their decompositions or the relevant combining accents
    is formally irrelevant.

    The Unicode Standard does not make subsets a normative construct
    for that standard and doesn't even mention MES-2. Conformance to
    10646 doesn't require you to make use of its subsets, but if anyone
    is worried about the articulation of the standards, the Unicode
    Standard itself formally consists of Subset 305 of 10646:2003,
    namely the "UNICODE 4.0" subset -- the subset which contains *all*
    of the encoded characters of 10646:2003.

    Think of the Multilingual European Subsets as a kind of
    way for people in Europe associated with standards organizations
    and governments to try to communicate with software vendors
    regarding which "user characters" they want to ensure are
    supported by their software. The CWA 13873 contains some
    questionable presuppositions about how software vendors are
    actually proceeding to roll out their Unicode support, but
    the intent of the CWA is clear:

    "It is estimated that implementing the full character set of the
    UCS may be costly in the first stages of UCS use, and that many
    manufacturers will implement in subset-stages. To ensure that a
    common subset usable to the vast majority of European users be
    available for a reasonable price, and as a guide to manufacturers,
    it will be helpful to specify, to users and procurers of systems,
    European subsets of the UCS encompassing the characters for use
    in European languages as well as other frequently used and
    specialist characters."

    > Also, is this list under full closure with existing character properties, like
    > NFKD decompositions, and case mappings?

    MES-2 is clearly *not* closed under NFD, NFKD, or NFKC normalizations.

    Although less obvious, it is also not closed under NFC
    normalization. For example, it includes the angle brackets
    U+2329, U+232A, but not their canonical equivalents,
    U+3008, U+3009. There are also some characters outside the MES-2
    repertoire where NFC(x) *is* in the MES-2 repertoire. Singleton canonical
    equivalences like U+212B ANGSTROM SIGN come to mind, for example.

    I haven't checked on case mappings and case foldings, but would
    not be too surprised to find an anomaly or two there, as well.

    MES-2 was not designed by the UTC, nor did it take any of
    these considerations into account. It is not really an
    appropriate construct for the Unicode Standard. A more
    meaningful way to think of it is: if you want to sell software
    in Europe, you better be able to input and display all the
    characters we Europeans have in this list.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jul 17 2003 - 21:12:00 EDT