Hypersurrogates

From: William_J_G Overington (wjgo_10009@btinternet.com)
Date: Mon Jun 15 2009 - 05:44:38 CDT

  • Next message: John H. Jenkins: "Re: Jyutping Phrase Box to be removed (was: Unihan database: kCangjie field)"

    At the moment there are regular Unicode characters and there are Private Use Area characters.

    I write to suggest that there could be added to Unicode a new category of Registerable Private Use Area characters.

    The new Registerable Private Use Areas would be located beyond U+10FFFF.

    In a regular Unicode plane there could be defined hypersurrogates such that a pair of hypersurrogate characters could be used in UTF16 to access a registerable private use area character codepoint using a sequence of two hypersurrogate characters, namely a high hypersurrogate character followed by a low hypersurrogate character.

    This would mean that eight bytes would be used to access a registerable private use area character codepoint using UTF16. That is a high surrogate and a low surrogate to produce a high hypersurrogate: and then a high surrogate and a low surrogate to produce a low hypersurrogate, the high hypersurrogate and the low hypersurrogate then producing the registerable private use area character codepoint.

    This method would allow the codepoints to be reached from within regular Unicode.

    The calculation of the registerable private use area character codepoint would be by extracting n bits from the high hypersurrogate, multiplying by 2 to the power of n, then extracting n bits from the low hypersurrogate and adding that value, followed by adding hexadecimal 110000. The value of n herein mentioned would be fixed during the encoding process of the Unicode Consortium: maybe a value such as 14 or maybe 15 or even 16.

    The registration of codepoints in the Registerable Private Use Areas would not be done by the Unicode Consortium or ISO. Some readers will know that there are a few initiatives that publish lists of Private Use Area mappings. These lists are often useful in various specialist fields, yet do have the disadvantage of the codepoint mappings not being unique, which can lead to problems for archiving.

    Quite how registration would work is a matter that could be discussed in this thread. There could be a number of administratively separate Registerable Private Use Areas. Registerable Private Use Area codepoints need not necessarily all be registerable by the same registrar, there could be a number of independent registrars.

    I remember that, in the days before the world wide web, much activity took place on Usenet email discussion groups. Starting a new Usenet discussion group was a long process needing a lot of support. However, there was also the .alt group system with its convention of discussing a suggestion in the alt.config newsgroup for at least seven days and if there was no great objection then the new group could be added. The system seemed to work well. A system manager at a site could decide not to accept .alt groups at all or only carry those specially requested by end users at his or her site.

    Registration for registerable private use area character codepoints would perhaps need a more structured approach as the registration would be permanent so that archives could rely on the registrations.

    Yet there do seem to be situations where a way to archive unambiguously a character that is not, or perhaps could not, be encoded in regular Unicode would be useful.

    So here is the basic suggestion and hopefully a discussion can take place to explore whether this should be done and, if so, to discuss the way of organizing the registration system.

    I was preparing this post and thought that I would look up the word hypersurrogate in the Unicode mail list archive, in case the word had been used before.

    I found the following post.

    http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML020/0103.html

    However, my suggestion is somewhat different from the earlier suggestion in various ways, in particular that the hypersurrogates that I am suggesting would be encoded in regular Unicode.

    William Overington

    15 June 2009



    This archive was generated by hypermail 2.1.5 : Mon Jun 15 2009 - 05:48:37 CDT