On Wed, 04 Jun 1997 14:34:06 +0200, Kent Karlsson wrote:
I hope that Chris can comment on whether or not this is a problem, and what
remedial action needs to be taken.
> B. This trick is designed for UTF-8 only, and does *not* work for
> Unicode/ISO/IEC10646 in general, which means it **cannot** be
> transformed into UTF-16 (nor UCS-4), without using some
> *other* way of representing the language tags.
I guess the issue boils down to "what do you believe will be used in the
future to represent plain text." That, in turn, depends upon your feelings
about the future of ASCII.
The following comments are based upon my own observations, and are thus
1) I've noted tremendous resistance to assigning codepoints outside of the
BMP, so I'm not much concerned about UCS-4.
2) Similarly, if I'm not mistaken, the differences between UTF-16, UCS-2, and
a sequence of 16-bit BMP codepoints are only an issue when characters are
assigned outside of the BMP.
3) Even if characters are assigned outside of the BMP, it is not necessarily
the case that the Internet community will use them.
4) ASCII will be around for a very long time in the Internet; which strongly
suggests a long-term usage of UTF-8 in preference to other encodings.
Now, in my own application, it is certainly the case that I'd use UTF-16 (more
precisely, a sequence of 16-bit BMP codepoints) in certain cases. Yet, in all
of those foreseeable cases, I specifically do NOT want to deal with language
tags. I appreciate the useful property that the string size in 16-bit hextets
is the same as the number of the characters in the string (which is
essentially the reason why I would use this instead of UTF-8 in these certain
cases). Thus, a situation in which UTF-16 does not have embedded language
tags is quite acceptable to me.
It would certainly be wonderful if the world switched to UTF-16 en masse one
day. But I just don't see that happening.
> C. "Higher level protocols" (e.g. MS-doc/RTF, HTML, etc., etc.)
> seems to be a more suitable place for handling language tags
> (and is where they are handled now).
This is the crux of the issue:
** Should there be such a thing as multilingual Unicode plain text? **
For years, the answer from Unicode has been "no, use rich text instead."
The time has come that this is no longer an acceptable answer. There are
those of use who *insist* upon having multilingual Unicode plain text. This
demand has been pent up for too long.
We would be pleased if Unicode were to provide an Officially Blessed Method to
do Unicode plain text. Unicode has had proposals (e.g. to assign codepoints
for language tags) within its membership to do this, and has rejected these
Actually, I think that the argument of "don't do this within the BMP" is
compelling. But that does not translate to "don't do this within a Unicode
Feel grateful that we have not asked for a 7-bit version of MLSF. Not yet,
I really feel that the most positive thing for Unicode to do at this point is
to announce "tagged UTF-8" (and perhaps "tagged UTF-7") which follow the basic
notion of MLSF (if not an exact copy).
> IMHO, MLSF should thus **not** be used.
I strongly disagree with this statement; unless an acceptable substitute can
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT