On Thu, 3 Jul 1997, Markus Kuhn wrote:
> I expect problems like this to be many orders of magnitude worse
> once Unicode starts to get widely used on the Web. The above
> problem is at least well-defined, the people using the
> 0x80-0x9f characters in HTML are clearly wrong, the HTML specification
> leaves no doubt about this. The problem is just that the authors
> of HTML export filters of one very popular word processor have been
> ignorant about the problem (I won't mention names here).
Some software makers have been ignorant in the past, but they
have catched up. If you think this one hasn't, please tell me
their name in private, and I will contact them.
> However, once you say that all of Unicode is allowed and
> every implementation just arbitrarily selects
> its subset, it will not even in theory any more be possible to
> blame whether the sender or the receiver is responsible for the
> messed-up document that has been displayed.
Just in contrary. The sender is responsible to encode all
characters correctly. The receiver is responsible to do
a best-effort job for display. What is left (can't be
displayed) can be clearly marked, and can't be helped
by defining more subsets, only by adding glyphs to fonts.
> Ok, we could have
> ISO 15646-1 something like MES, WGL4 (minimum European)
> ISO 15646-2 something like EES (GUI European)
> what else? I guess for instance that also Japanese users might
> be interested in a well-defined Unicode subset (ISO 15646-3),
> that might for instance be something like MES plus Japanese
> ideographics, but no bidi characters, no Indic scripts, etc.
> ISO 15646-3 (Japanese)
> ISO 15646-4 (India)
The Japanese already have a nice collection of seven subsets:
(1) Basic Japanese (JIS 201 and JIS 208)
(2) Japanese non ideographic supplement (from JIS 212)
(3) Japanese ideographic supplement 1 (about 1000 Kanji from JIS 212)
(4) Japanese ideographic supplement 2 (rest of Kanji from JIS 212)
(5) Japanese ideographic supplement 3 (Kanji not in JIS 208 or 212)
(6) Fullwidth alphanumeric
(7) Halfwidth katakana
Any maker can use these to define what they support.
> In the end, ISO 15646 could become a very small family of subsets
> for specific language families, very much like ISO 8859 is
> already, with around 5 or 8 different subsets that can be easily
The problem here is "very small". There are so many different
needs, different vendors, and so on, that this is very difficult.
> My e-mail MIME header will announce that this posting is in the
> ISO 15646-1 character set, and we do not have to talk any more
> about "the Unicode subset that Windows-NT 4.0 currently supports in
> more than a third of its fonts".
With the effect that if the recipient's software doesn't support
15646-1, it won't display anything, even if it would completely
cover those characters you actually used in your mail.
One idea of Unicode is just to avoid such cases.
> Well, I just checked the Java spec, and they do allow combining
> characters in identifiers, but they warn about the problems. They
> also warn about problems with homoglyphs like latin letter A and
> cyrillic letter A, but I guess these are unavoidable, even in small
> subsets like MES. I hope regular expression mechanisms will
> deal with homoglyps like the micro sign and the greek small mu, or
> the capital letter K and the Kelvin sign (never understood the
> difference between those anyway).
There is none. It's for backwards compatibility only.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT