Re: Newbie Question - what are all those duplicated characters FO R?

From: John Cowan (cowan@mercury.ccil.org)
Date: Mon Aug 11 2003 - 07:25:48 EDT

  • Next message: Kent Karlsson: "RE: Questions on ZWNBS - for line initial holam plus alef"

    Jill.Ramonsky@Aculab.com scripsit:

    > Stefan has effectively dealt with SOME of my confusion, but questions
    > remain. For example: between 1D49C (mathematical script capital A) and
    > 1D49E(mathematical script capital C) we find 1D49D (<reserved>). What is it
    > reserved for? I am aware that codepoint 212C is script capital B, but why
    > does that justify leaving a "hole" in the codepoint space? Why not just omit
    > "mathematical script capital B" without leaving a hole? (i.e. why not just
    > go straight from A to C?).

    Primarily for implementation simplicity. It's possible to convert between
    any of the mathematical "fonts" and any other, or the corresponding "normal"
    ones, with a simple offset plus a short table of exceptions.
    Code space on plane 1 just isn't that precious. Similar things have
    been done throughout Unicode: for example, in the main Greek block,
    there is a hole where "capital letter final sigma" would be, since there
    is no such character: the final/non-final distinction is not made in
    capital letters.

    > More questions. From E0020 to E007E we have "tag space" through to "tag
    > tilde". These are copies of the Basic Latin block at 0020. I still don't
    > know what they are for.

    The tag characters are used to embed tags, specifically language tags,
    in contexts where markup is too heavyweight but it seems essential to
    record the language of a text. One such application is in protocol
    design, where it is occasionally necessary to pass around human-readable
    strings within the protocol, and it is desirable to supply the correcdt
    string for a given language. All other uses are strongly discouraged.

    But if you have to do it, you can encode "en-us" (the language code for
    U.S. English) using <E0001, E0065, E006E, E002D, E0075, E0073>. For
    all purposes other than language identification, tag characters are
    ignored.

    -- 
    John Cowan  jcowan@reutershealth.com  www.ccil.org/~cowan  www.reutershealth.com
    "In computer science, we stand on each other's feet."
            --Brian K. Reid
    


    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 08:01:08 EDT