G-Strings

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Dec 16 2003 - 11:06:13 EST

  • Next message: Christopher John Fynn: "WG2 - anyone from the UK interested?"

    There was talk recently on this list of mapping grapheme clusters to the
    PUA (for application internal use only, obviously, not for export to the
    real world). I actually did this recently, though my app ended up in an
    incomplete state since I got bored and moved onto something else. The
    algorithm worked though, so I present it here and place it in the public
    domain, licence free, for anyone to use who wants to do so. Such an
    encoded string I called a "grapheme string", or "gstring" for short. Of
    course, that was before "grapheme" was renamed as "default grapheme
    cluster", so the name doesn't work quite as well now.

    The range of characters I resereved for my private use actually
    consisted of the surrogate codepoints, not the PUA codepoints. I
    reasoned that the PUA area might actually be being used for something
    (else), but the surrogate codepoints were illegal and therefore
    available. Despite the fact that number of possible graphmes is
    infinite, I never actually ran out of codepoints.

    Here's the algorithm in pseudo-code:

    // The following are static and global
    max_word (a 16-bit unsigned integer, initially the lowest codepoint you
    reserve (e.g. the start of the PUA) minus one)
    map_grapheme_to_word[] (a mapping from grapheme (=array of codepoints)
    to 16-bit word, initially empty)
    map_word_to_grapheme[] (a mapping from 16-bit word to grapheme,
    initially empty)

    // Convert unicode text to internal representation with one 16-bit word
    per grapheme
    // -- input (text_unicode) is an array of codepoints (ie. it has already
    been decoded from UTF-whatever)
    // -- output (text_internal) is an array of 16-bit words, each
    representing one grapheme. THIS STRING MAY NEVER BE EXPORTED.

    text_internal = ""
    for (each grapheme in text_unicode) // each grapheme is a substring of
    one or more codepoints
    {
        grapheme = convert_to_NFC(grapheme);
        if (num_codepoints(grapheme) == 1 && codepoint_of(grapheme) < 0x10000)
        {
            text_internal += codepoint_of(grapheme);
        }
        else
        {
            if (!exists(map_grapheme_to_word[grapheme]))
            {
                if (max_word still in range)
                {
                    map_grapheme_to_word[grapheme] = ++max_word;
                    map_word_to_grapheme[max_word] = grapheme;
                }
                else
                {
                    text_internal += U+FFFD; // Whoa!! Ran out of reserved
    characters! Could add error handling here.
                }
            }
            text_internal += map_grapheme_to_word[grapheme];
        }
    }
    return text_internal;

    // The converse process
    text_unicode = "";
    for (each word in text_internal)
    {
        if (word in correct range) // e.g. PUA but doesn't have to be
        {
            if (exists(map_word_to_grapheme[max_word]))
            {
                text_unicode += map_word_to_grapheme[max_word];
            }
            else
            {
                // error - should never happen
                text_unicode += U+FFFD;
            }
        }
        else
        {
            text_unicode += word;
        }
    }
    return text_unicode;

    Jill



    This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 11:53:14 EST