RE: Newbie Question - what are all those duplicated characters FO R?

Date: Mon Aug 11 2003 - 11:07:46 EDT

  • Next message: John Cowan: "Re: Newbie Question - what are all those duplicated characters FO R?"

    Hey thanks - I think I've got all that now.

    Of course, I'm tempted to wonder whether or not it would have made more
    sense to simply have introduced a few new combining characters in plane 0,
    such as: "make bold", "make italic", "make script", "make fraktur", "make
    double-struck", "make sans serif", "make monospace" and "make tag". This
    would not only have achieved the same effect (and with the same space
    requirements too, at least for things like "bold uppercase A" in UTF-16),
    but with much greater flexibility (in that you could also make _other_
    characters bold too, and you could create combinations of the attributes not
    currently represented).

    I still haven't figured out what "fullwidth" means though. I don't really
    understand in what way a "full width full stop" (FF0E) is different from a
    "full stop" (002E), etc. I _have_ downloaded, and read in entirety, the code
    chart document for FF00-FFEF, and nothing in that document explains to me
    why these characters are necessary. Does anyone have any clues on that one?


    -----Original Message-----
    From: John Cowan []
    Sent: Monday, August 11, 2003 12:26 PM
    Subject: Re: Newbie Question - what are all those duplicated characters
    FO R? scripsit:

    > Stefan has effectively dealt with SOME of my confusion, but questions
    > remain. For example: between 1D49C (mathematical script capital A) and
    > 1D49E(mathematical script capital C) we find 1D49D (<reserved>). What is
    > reserved for? I am aware that codepoint 212C is script capital B, but why
    > does that justify leaving a "hole" in the codepoint space? Why not just
    > "mathematical script capital B" without leaving a hole? (i.e. why not just
    > go straight from A to C?).

    Primarily for implementation simplicity. It's possible to convert between
    any of the mathematical "fonts" and any other, or the corresponding "normal"
    ones, with a simple offset plus a short table of exceptions.
    Code space on plane 1 just isn't that precious. Similar things have
    been done throughout Unicode: for example, in the main Greek block,
    there is a hole where "capital letter final sigma" would be, since there
    is no such character: the final/non-final distinction is not made in
    capital letters.

    > More questions. From E0020 to E007E we have "tag space" through to "tag
    > tilde". These are copies of the Basic Latin block at 0020. I still don't
    > know what they are for.

    The tag characters are used to embed tags, specifically language tags,
    in contexts where markup is too heavyweight but it seems essential to
    record the language of a text. One such application is in protocol
    design, where it is occasionally necessary to pass around human-readable
    strings within the protocol, and it is desirable to supply the correcdt
    string for a given language. All other uses are strongly discouraged.

    But if you have to do it, you can encode "en-us" (the language code for
    U.S. English) using <E0001, E0065, E006E, E002D, E0075, E0073>. For
    all purposes other than language identification, tag characters are

    John Cowan
    "In computer science, we stand on each other's feet."
            --Brian K. Reid

    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 12:39:22 EDT