Re: Decimal separator with more than one character?

From: Philippe Verdy (
Date: Thu May 15 2003 - 18:15:08 EDT

  • Next message: Allen Haaheim: "Re: how to sort by stroke (not radical/stroke)"

    From: "Deborah Goldsmith" <>
    > On Wednesday, May 14, 2003, at 10:37 PM,
    > wrote:
    > > Define "one character" :-)
    > One Unicode character, i.e., one Unicode Scalar Value.

    More exactly one 21-bit codepoint.

    Can be encoded by a single 32 bits codeunit or as an ordered fixed sequence of 3 bytes (in LSB or MSB byte order).

    Can be encoded with variable length as one UTF-16 code unit equal to the codepoint value if it's in 0x0000-0xD800 or in 0xE000 and 0xFFFD, or two UTF-16 codepoints for surrogates code units (the first one in 0xD800-0xDBFF, the second in 0xD800-0xDBFF.

    Can also be encoded with UTF-8 as 1 ASCII byte in 0x00 to 0x7F equal to the Unicode codepoint value, or else as a UTF-8 leader in 0xC2-0xEF, and one to five UTF-8 trailer bytes in 0x80-0xBF.

    Other encoding lengths are possible if you use other encodings.

    --- Note:

    There could be possible definitions of a character consisting in a sequence containing a "starter" character followed by comining characters or diacritics, or using form-variant prefixes (such as in Brahmic scripts) or suffixes (Variant Selectors).

    You could also view this character as eing canonically equivalent to a decomposed string.
    It's true that the term "character" is ambiguous in Unicode, which already encodes many single characters with possily equivalent sequences of 1 or more code points.

    What is not ambiguous in Unicode is the term "code point". So your question for the decimal separator translates as:
        Do I need to allow several codepoints to encode a decimal separator ?

    The response is:
        For now, no because all standardized scripts defined for languages studied do not need to use characters that need several codepoints to translate the decimal separator.
        But in the future, there MAY exist some scripts that use "terms" to commonly designate the decimal separator, and which require multiple Unicode encoded characters or multiple codepoints for that character (unless Unicode standardizes a compatibility equivalent character for that purpose to avoid using combining sequences).

    If you look in the scripts that were encoded, Unicode took care of assigning a single codepoint for digits and punctuation used in all scripts (Unicode could have defined the Brahmic or Old Greek digits with a variant selector after the base decimal digit, but it did not, even for Roman numerals that were all encoded up to 9 instead of using a Roman digit decomposition...)


    This archive was generated by hypermail 2.1.5 : Thu May 15 2003 - 19:09:03 EDT