Re: Planck's constant U+210E

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Apr 21 2006 - 16:56:52 CST

  • Next message: Philippe Verdy: "Re: Unicode fonts"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > Jukka K. Korpela wrote on Friday, April 21, 2006 at 7:02 PM
    >
    >> There's a potential future problem. Mathematicians keep inventing new
    >> symbols as they need them, using, say, Latin or Greek letters in some
    >> particular style (say, bold italic underlined and overlined - there are
    >> infinite possibilities).
    >
    > Do you mean seriffed, like <U+1D482, U+0331, U+0304>, or sans-serif, like
    > <U+1D656, U+0331, U+0304>? (I assume you didn't mean with connecting
    > underline and overline.)
    >
    > A few may be overlooked.
    >
    > But don't forget that the set of CJK ideographs isn't closed either. Very
    > little is actually closed. I think it would be prudent to reserve a
    > surrogate plane for fifty years if using sequences of three surrogates
    > (high-high-low and high-low-low) to extend UTF-16 is unacceptable. If
    > reserving it hurts, extension is obviously necessary. (Actually, only about
    > four thousand 'middle surrogate' points need be reserved if we use high-high
    > surrogate for middle surrogate-low surrogate for middle surrogate-low
    > surrogate for the points not needed during the lifetime of current UTC
    > members.)

    I already proposed to keep the few code points that are just between hangul syllables and existing surrogates, unassigned until further notice (U+D7B0..U+D7FF), to create new types of surrogates, should they ever become necessary in some long future. They could be used as high hyper-surrogates in conjonction with the existing surrogate types to map extra planes (but this would cause stability problems with existing or future normalization of codepoints allocated in supplementary planes, which could break those code points). This won't happen however if the codepoints taken in the supplementary planes to map the low hyper-surrogates are not assigned for something else, and they will behave correctly with existing UTF-16 implementations without generating conflicts).

    For example, suppose that U+D7C0..U+D7FF are those high "hyper-surrogates". They create an extension for 32 "hyper-planes". The size of hyperplanes should be at least a few tens of planes, so we would need to map hyperplanes using a middle hyper-surrogate and a low hyper-surrogate. There should be at least 8192 middle hyper-surrogates and 8192 low hyper-surrogates. Both would be easily allocated within special plane 14.

    So we would have such encoding like :
    * D7C0..D7FF : high hyper-surrogates (32 distinct values)
    * E8400..EA3FF: middle hyper-surrogates (8192 distinct values), mapped in UTF-16 as:
       · DBA0..DBA7: standard high 16-bit surrogates (8 values)
       · DC00..DCFF: standard low 16-bit surrogates (1024 values)
    * EA400..EC3FF: low hyper-surrogates(8192 distinct values), mapped in UTF-16 as:
       · DBA8..DBAF: standard high 16-bit surrogates (8 values)
       · DC00..DCFF: standard low 16-bit surrogates (1024 values)
    This would give 0x80000000 additional codepoints (more than 2 billions!), or 32768 more planes. But this would be at the price of length in UTF-16 (because it would require five 16-bit surrogates).

    We could have shorter representation in UTF-16 using only three 16-bit surrogates, but this would require keeping only low hyper-surrogates above, possibly from a single but larger range:
    * D7C0..D7FF : high hyper-surrogates (32 distinct values)
    * E84000..EC3FF: low hyper-surrogates (32768 distinct values), mapped in UTF-16 as:
       · DBA0..DBAF: standard high 16-bit surrogates (16 values)
       · DC00..DCFF: standard low 16-bit surrogates (1024 values, minus some exceptions)
    This gives a total of 0x100000 codepoints (more than 1 million), filling only 16 more planes.
    Given that such use will not happen before a very long time, I wonder if the space requirement in UTF-16 (three 16-bit code units) would be good to consider.

    My favorite would go for the representation on four 16-bit code-units, also because it fits on 32 bits and can then be tested in a single load operation in a processor). In that case, we don't need to keep space in the BMP for new surrogate types, but we can consider keeping 16384 codepoitns in special plane 14 to create two spaces for hyper-surrogates, so we would have finally:
    * E8400..EA3FF: high hyper-surrogates (8192 distinct values), mapped in UTF-16 as:
       · DBA0..DBA7: standard high 16-bit surrogates (8 values)
       · DC00..DCFF: standard low 16-bit surrogates (1024 values)
    * EA400..EC3FF: low hyper-surrogates (8192 distinct values), mapped in UTF-16 as:
       · DBA8..DBAF: standard high 16-bit surrogates (8 values)
       · DC00..DCFF: standard low 16-bit surrogates (1024 values)
    This would give 0x80000000 additional codepoints (more than 2 billions!), or 32768 more planes. Such range is easy to test, either with existing UTF-16 or with existing UTF-32, and do not require changing them (however they would remain too relaxed to verify the strict validity of the extended encoding).

    In existing UTF-32, those code points would be represented as:
    * E8400..EA3FF: high hyper-surrogates (8192 distinct values)
    * EA400..EC3FF: low hyper-surrogates (8192 distinct values)
    They would represents all "hyper" codepoints in the 32768 "hyper" planes starting at plane 17 (so from U+110000 to U+8010FFFF, a range that could be restricted to fit only 32-bit positive values, by dropping the support for the last 17 planes, or by dropping the 17 first ones).

    The conversion to existing UTF-8 would be based on the two 32-bit code units above, so it would be two sequences of 4 bytes each.

    In existing UTF-16, those code points would be represented as:
    * DBA0..DBA7: standard high 16-bit surrogates (8 values) for high hyper-surrogates
    * DC00..DCFF: standard low 16-bit surrogates (1024 values)
    * DBA8..DBAF: standard high 16-bit surrogates (8 values) for low hyper-surrogates
    * DC00..DCFF: standard low 16-bit surrogates (1024 values)
    (Compatible with existing conforming conversions from UTF-32 to UTF-16).

    All those would remain fully predictive from any random code unit position, with a small bounded number of backward lookups (up to one extra lookup for UTF-32, up to two extra lookups for UTF-16, up to three extra lookups for UTF-8 to find the first lead byte which, when decoded may need one extra lookup to get the other lead byte).

    But of course, we would consider creating a new UTF to accept are presentation in a single 32-bit code unit for the codepoints that are in the 32768 planes above the 17 existing planes. UTF-8, UTF-16, and UTF-32 would need to be extended to cover the new range with stricter rules. But this requires renaming them.

    We would create UXF-8, UXF-16 and UXF-32 that would be fully backward compatible with UTF-8, UTF-16 and UTF-32 (this means that any document valid in a UXF form would be valid when decoded with a UTF form, but the reverse would not be true, given that there's no enforcement in existing UTF's to see if the new "hyper-surrogates" are encoded in valid pairs.

    Note that UXF-8 (backward compatible with existing UTF-8, just extended to provide more strict checking of hyper-surrogates pairs) would "waste" bytes, as it would require 8 bytes, instead of 5 or 6 in the old RFC-UTF-8 format. I don't think this is a problem as we are speaking for long term.



    This archive was generated by hypermail 2.1.5 : Fri Apr 21 2006 - 16:58:12 CST