At 12:58 PM -0700 6/6/97, "Martin J. Duerst" <firstname.lastname@example.org> wrote:
><<>> The term 'language tag' should be reserved for the short identifier
><<>> of RFC 1766 [RFC-1766] that only serves to identify the language.
><<>> While there may be other text attributes intimately associated with
><<>> the language of the document, such as desired font or text direction,
><<>> these should be specified with other identifiers rather than
><<>> overloading the language tag.
We have a definition for language tags. I think we should stick with it.
>> As I see it there are several
>> alternative approaches we can take to adding language tags. Here's my
>> list along with what I see as the advantages and disadvantages of each
>> (0) Define new character code points for the tags.
>> + Works with UCS-16 and UTF-16 as well as UTF-8.
>> - Ends up with codepoints that aren't characters.
>> - Requires support of codepoints outside of BMP.
Technically true but in practice not a problem. The surrogate character
mechanism, UTF-16, uses 2048 codepoints to encode 1048576 additional
>> - May potentially conflict with future UTC codepoint assignment or with
>> private use assignments (depending on the region used).
If it were accepted, the code point would not be reused.
>> (1) Embed the tags using illegal UTF-8 sequences. (MLSF is one such scheme;
>> there are others.)
>> + Very easy to parse.
>> + Invisible at the codepoint level; keeps tags out of the character
>> + The additional level added is very lightweight.
>> - Not completely compatible with UTF-8.
>> - Cannot be used in conjunction with UCS-16 or UTF-16.
>> (2) Use some form of rich text.
>> + Conceptually simpler than any other scheme.
>> + Lots of experience with systems of this sort; we know they work.
>> - Very heavy compared to (1).
>This is a very good overview. The second point of (2) takes up what
>I was saying earlier when referring to IETF engineering principles.
>What I don't agree with is the second point in (1). Whether these
>codes are invisible, whether they will turn up as something else
>(whatever that might be) or whether they will break an application
>is totally open. Anything may happen, and will happen. That
>the tags are not conforming to the UTF-8 syntax neither makes
>them invisible (unless you strip them, which you can do with
>all other solutions) nor does it keep them out of the
>That is definitely the best idea I have seen so far!
>> I currently like (1) the best, but like Mark I could be convinced to go
>> with some other approach. The one approach that isn't acceptable to me is
>> to say that there's no problem we have to solve here.
(0) will lead to calls for other non-character codes for use in protocols.
Unicode and ISO/IEC 10646 are character codes, and should remain character
(1) will break working software. No standard should recommend misuse of
(2) accords with IETF principles and existing standards.
There is no point in saying that (1) is more efficient if it frequently
fails to work. There is an old programming horror story quoted by Gerald
Weinberg which ends, "But your program doesn't work. If mine doesn't have
to work, it can run at 1,000 cards a second, which is faster than your card
-- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT