RE: Level of Unicode support required for various languages

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Oct 31 2007 - 09:17:53 CST

  • Next message: Andrew West: "Re: Level of Unicode support required for various languages"

    David Starner wrote:
    > On 10/30/07, vunzndi@vfemail.net <vunzndi@vfemail.net> wrote:
    > > Quoting "Mark E. Shoulson" <mark@kli.org>:
    > > > vunzndi@vfemail.net wrote:
    > > >> The minimum is likely to be about 30 thousand, to be honest nobody
    > > >> knows what the upper limit is, but 100k would not be inconceivable.
    > > > I knew that dropping planes 17-65535 was a bad idea!

    A bad idea for what? If this means dropping planes that would be needed for
    private use, there's ample enough space in the PUA blocks to create private
    surrogates and to extend your limit at will to support billions of private
    use characters.

    > > Yes, this has always struck me as a strange decision, but it would
    > > seem to be one that can be reversed without any stability issues.
    > Except for breaking all the code out there that uses UTF-16, SCSU, or
    > otherwise depends on the limit.

    Yes, allocating private surrogates in PUA blocks will not affect UTF-16 and
    SCSU requirements (where they will be treated as if they were isolated PUAs,
    even if they are actually used in sequences as private surrogates; these
    UTFs will not detect the sequence boundaries but it does not matter here,
    for PUAs whose interpretation is left to the private usage).

    The only important thing in application is just to make sure that sequences
    of PUAs will not be reordered or truncated without knowledge of their
    semantic. At best, one could suggest that ANY sequence of PUAs is a
    transparent unbreakable binary object in applications, until one of the PUAs
    is recognized by the locally recognized PUA convention.

    So a sequence like <E400 E800 EC00> still can be privately interpreted as
    the encoded form for a single private entity (each code point here is used
    as a private surrogate: a leading high "private surrogate", a middle one,
    and a low one, each one containing 10 bits of information), even if the
    Unicode compliant applications see this sequence as three distinct PUAs. The
    sequence does not contain any one of the standard surrogates in D800..DBFF
    and DC00..DFFF, so this does not break UTF-16 rules.

    Now if more than one billion private use entities is not enough, you could
    use longer sequences like <F0000..F3FFF, F4000..F7FFF, F8000..FBFFF> (using
    PUAs in plane 15): each private PUA here encodes 12 bits of information, the
    whole sequence representing 36 bits (more than 64 billions encoded
    entities...). In UTF-32, it will be stored as 3 PUAs in plane 15 (total = 96
    bits of storage or 12 bytes); in UTF-16 it will be stored as six standard
    PUAs of the BMP (total = 96 bits of storage or 12 bytes); in UTF-8, it will
    be stored as 3 sequences of 4 bytes (total = 12 bytes): there's no
    difference in space requirement here between UTF-8, UTF-16 and UTF-32.

    This sequence will still support SCSU and other Unicode compression
    encodings without difficulty, as well as any other encoding (GB18030?) that
    has a roundtrip mapping with existing Unicode PUAs, and do not change their
    relative encoding order, and do not truncate long sequences of PUAs.

    In other words, there's no need to support more than 17 planes for the
    encoding of standard characters. And there's ample enough space for encoding
    any number of private entities (I won't call them characters because this
    conflicts with Unicode definition of characters) with the existing PUAs that
    have been allocated.

    The difficulty is then not in private use characters, but in resisting to
    the encoding of many new characters within the set of standard characters
    (within one of the first 15 planes), when this is not justified. A defensive
    way to protect the UCS from being filled with many ideographs would be to
    create a standard compositional encoding for these personal names used in
    China and Taiwan.

    Imagine what would happen if each one of the billion Chineses wants his own
    ideograph for his signature, and these signatures are made legal in PRC or
    Taiwan! A stringer compositional model, based on the initial principles of
    existing IDS as defined in TUS (but *with extensions* such as those used at
    IRG) would be helpful to make the scheme workable for the very long term.

    ********

    Another way to handle this would be to define a core subset of the SVG
    graphic format for creating custom glyphs that remain comparable, and
    assignable to a URI (such as an URI to a record national public registry
    containing the definition of this SVG file), and then define the way to
    safely transport this URI in documents: this does not require any encoding
    in Unicode, it will permit any number of personal names.

    One way to encode this in plain text would be to register only characters in
    plane 14 or the encoding of these special glyph references (we have language
    tags starting by E0001, we could as well reuse the rest of these special tag
    characters, by using another prefix tag to indicate that there's a glyph
    reference.)

    (a security risk is the possible creation of confusable, but this could be
    avoided by using only URIs hosted by certified national registries, where
    even the SVG stored may be tagged, and possibly redirected for unification
    purpose).

    The national registry could also store properties, in the XML entry pointed
    to by this URI, like an IDS description string (to help recognizing
    confusables), and the expected usage (people lifetime, frequency, dates of
    creation and expected end of usage, linguistic and semantic information,
    possible transcriptions and vocal spelling...).

    Philippe.



    This archive was generated by hypermail 2.1.5 : Wed Oct 31 2007 - 09:22:32 CST