Re: Newbie Question - what are all those duplicated characters FO R?

From: John Cowan (cowan@mercury.ccil.org)
Date: Mon Aug 11 2003 - 07:25:48 EDT

Next message: Kent Karlsson: "RE: Questions on ZWNBS - for line initial holam plus alef"

Previous message: Theodore H. Smith: "[off] XML. And RAM"
In reply to: Jill.Ramonsky@Aculab.com: "RE: Newbie Question - what are all those duplicated characters FO R?"
Next in thread: Jill.Ramonsky@Aculab.com: "RE: Newbie Question - what are all those duplicated characters FO R?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Stefan has effectively dealt with SOME of my confusion, but questions
> remain. For example: between 1D49C (mathematical script capital A) and
> 1D49E(mathematical script capital C) we find 1D49D (<reserved>). What is it
> reserved for? I am aware that codepoint 212C is script capital B, but why
> does that justify leaving a "hole" in the codepoint space? Why not just omit
> "mathematical script capital B" without leaving a hole? (i.e. why not just
> go straight from A to C?).

Primarily for implementation simplicity. It's possible to convert between
any of the mathematical "fonts" and any other, or the corresponding "normal"
ones, with a simple offset plus a short table of exceptions.
Code space on plane 1 just isn't that precious. Similar things have
been done throughout Unicode: for example, in the main Greek block,
there is a hole where "capital letter final sigma" would be, since there
is no such character: the final/non-final distinction is not made in
capital letters.

> More questions. From E0020 to E007E we have "tag space" through to "tag
> tilde". These are copies of the Basic Latin block at 0020. I still don't
> know what they are for.

The tag characters are used to embed tags, specifically language tags,
in contexts where markup is too heavyweight but it seems essential to
record the language of a text. One such application is in protocol
design, where it is occasionally necessary to pass around human-readable
strings within the protocol, and it is desirable to supply the correcdt
string for a given language. All other uses are strongly discouraged.

But if you have to do it, you can encode "en-us" (the language code for
U.S. English) using <E0001, E0065, E006E, E002D, E0075, E0073>. For
all purposes other than language identification, tag characters are
ignored.

-- 
John Cowan  jcowan@reutershealth.com  www.ccil.org/~cowan  www.reutershealth.com
"In computer science, we stand on each other's feet."
        --Brian K. Reid

Next message: Kent Karlsson: "RE: Questions on ZWNBS - for line initial holam plus alef"
Previous message: Theodore H. Smith: "[off] XML. And RAM"
In reply to: Jill.Ramonsky@Aculab.com: "RE: Newbie Question - what are all those duplicated characters FO R?"
Next in thread: Jill.Ramonsky@Aculab.com: "RE: Newbie Question - what are all those duplicated characters FO R?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 08:01:08 EDT