Re: Transcriptions of "Unicode"

From: James Kass (jameskass@worldnet.att.net)
Date: Wed Dec 06 2000 - 22:54:18 EST


Erik van der Poel wrote:

> > >
> > > The font selection is indeed somewhat haphazard for CJK when there are
> > > no LANG attributes and the charset doesn't tell us anything either, but
> > > then, what do you expect in that situation anyway? I suppose we could
> > > deduce that the language is Japanese for Hiragana and Katakana, but what
> > > should we do about ideographs? Don't tell me the browser has to start
> > > guessing the language for those characters. I've had enough of the
> > > guessing game. We have been doing it for charsets for years, and it has
> > > led to trouble that we can't back out of now. I think we need to draw
> > > the line here, and tell Web page authors to mark their pages with LANG
> > > attributes or with particular fonts, preferrably in style sheets.
> >
> > A Universal Character Set should not require mark-up/tags.
> >
> > If the Japanese version of a Chinese character looks different
> > than the Chinese character, it *is* different. In many cases,
> > "variant" does not mean "same".
>
> I was referring to the CJK Unified Ideagraphs in the range U+4E00 to
> U+9FA5. I agree that those codes do not *require* mark-up/tags, but if
> the author wishes to have them displayed with a "Japanese font", then
> they must indicate the language or specify the font directly. The latter
> may be problematic. I don't think it's reasonable to expect a browser to
> apply various heuristics to determine the language.
>

I completely agree that it is not reasonable to expect a browser
to guess the language. Since browsers primarily display
information, the browser doesn't really need to be language-aware
in most cases. Exceptions like word-breaks for Thai and related
scripts exist, of course. Even scripts which don't use spaces
or other word breaks can be encoded with the special spacing
variants available in the Unicode Standard, though.

> > When limited to BMP code points, CJK unification kind of made
> > sense. In light of the new additional planes...
> >
> > The IRG seems to be doing a fine job.
>
> Somehow I get the impression that you have more to say, but you just
> aren't saying it. Cough it up already. :-)
>

Sorry, I'm trying to learn how to be brief (!) and hoped the
inference would be apparent. Although the IRG still
considers unification relevant, it seems to me that they
are much tighter now in their definition of 'sameness'
than was previously the case. Not all of the approx 40000
"new" characters in Plane 2 are the names of race horses,
some of them, as far as I can tell, would have been unified
before.

Consider the "teeth" ideograph(s). (Radical number 211, in
some radical lists.) Because this is a radical, CJK encoders
can select the specific desired character:
U+2FD2 for Traditional Chinese
U+2EED for Japanese
U+2EEE for Simplified Chinese

Since anyone encoding U+9F52 might see any of the above
three versions, my opinion is that encoders (authors) would
wish to explicitly encode their expected character and would
do so whenever they have the option. I believe that they
should have the option. The abundance of unassigned code
points offered by additional Unicode planes makes this
possible and would eliminate the need for a browser
(or any other application) to "guess" a language in order
to display material as its authors and users desire.

Best regards,

James Kass.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT