Re: Unicode characters List instead of hexadecimal equivalent

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 30 2006 - 10:00:16 CDT

  • Next message: Andries Brouwer: "Re: kurdish sorani"

    From: "Dean Harding" <dean.harding@dload.com.au>
    >> (Microsoft.VisualBasic.ChrW(System.Convert.ToInt32(hex, 6))
    >
    > If you want to do the .NET way, just do:
    >
    > string value = "3df";
    > char ch = (char) Convert.ToInt32(value, 16);
    >
    > Now, because in .NET a char is actually a UTF-16 codepoint, you won't get
    > anything > U+FFFF. And also, this doesn't take into account unassigned or
    > invalid codepoints and stuff (unless you're getting your input data from the
    > Unicode database already)

    A more complete code snippet:

    string value = "03DF"; // Sample hexaecimal code point, extracted from the Unicode database

    string ch; // The result of the conversion to text will be placed there
    int codepoint = Convert.ToInt32(value, 16); // Temporary
    if (codepoint > 0xFFFF) {

        codepoint -= 0x10000;
        int highsurrogate = 0xD800 + (codepoint >> 10),
            lowsurrogate = 0xDC00 + (codepoint & 0x3FF);
        ch = "" + (char)highsurrogate + (char)lowsurrogate;
    } else {
        ch = (char)codepoint;
    }

    Note: Don't look for any Unicode character in the special range 0xD800-0xDFFF, as there's none; these code points are permanently reserved for two successive sets of surrogates that are not characters. No Unicode-compliant font will contain any glyph for those code points because they are not assigned to any character. Although you don't have any restriction in .Net about these code units when you use the "char" and "string" datatypes, if you don't follow these restrictions, the result will be a text that is not compliant to Unicode when the surrogates are not correctly paired as above, or if you break a "string" in the middle of a surrogates pair.

    The result in ch is a "string" instance, not a "char", because it will contain one or two "char", even though it contains only one Unicode character. The "char" datatype in .Net is not a "character", but a 16-bit "code unit" according to Unicode. And the "string" datatype in .Net is a stream of code units, which is correct for storing any sequence.

    The "string" datatype is permissive andallows storing text which is not conforming to Unicode, because it can store forbidden or unallocated codepoints, or can store invalid UTF-16 sequences where surrogates are not paired as built above. (However this "string" datatype is compatible for storing any Unicode-compliant text, and .Net offers various functions or methods to work on string objects according to Unicode algorithms, but you are not limited to those methods and can use the datatype to store any thing, including for storing random vectors of 16-bit integers).

    The same is true for .Net as well as many other languages and platforms (including Java, or ECMAScript, or the "Windows Unicode API" which use the same encoding paradigm based on 16-bit code units, and strings built as streams of such code units).

    Text and string encoding has no implication about how the characters will be finally rendered on your screen or printer, simply because plain-text do not contain the glyphs (letter forms...) which are also subject to transformations not related to the character encoding (such as font style, boldness, slanting/italicizing, shadowing, 3D transformation, coloring, texturing...) or because it may be rendered very differently without using the glyph paradigm (for example with Braille patterns using text-to-Braille devices, or orally using text-to-speech renderers)



    This archive was generated by hypermail 2.1.5 : Wed Aug 30 2006 - 10:04:30 CDT