I was thinking of plain, otherwise untagged, text. Of course if you are
using HTML, XML etc. you should *only* be using the language tagging
mechanisms available and supported in those standards.
With plain otherwise untagged text there are many cases where
you can't know how to sensibly render the text (even if you know
the what the basic script is) without also knowing what the language
is. In those cases at least, I think there is a strong argument that there
should be some way of indicating this within the plain text standard
itself. These plane 14 characters provide a means of doing this.
There is nothing which states that if HTML and XML applications
use ISO10646 character encoding that they must recognise and
deal with these language tag characters - only that they should
accept them. Such applications could simply display some kind of
placeholder glyph - or not display them at all. If I were writing an
XML or HTML editor I would probably include the option of
converting such characters to equivalent XML or HTML tags
when importing plain text, and have the option of trying to
convert XML or HTML language tags to these characters when
exporting to a plain Unicode text file without other mark-up.
The other thing I was (wryly) observing is that once
you put some sort of language tags in text - and make use of
them - then, for about the same amount of work ,you could be using
language tags (not necessarily of the same sort) to switch
between standard encodings for those scripts - something
most people using Unicode are trying to get away from.
This observation of course naively assumes a standard
encoding for each script. I wasn't trying to suggest
for a moment anyone should actually do this - simply
trying to point out that in many cases you can't get
away from specifying language as well as script system
even if you want to. Script systems are of course
more or less "tagged" simply by the code range of
BTW shouldn't we be speaking of "Plane 0E language tags" instead of
"Plane 14" since hexadecimal notation is normative in the ISO 10646
and Unicode Standards? ISO 10646 has two hundred and fifty six
planes (00 to FF) so there is another "plane 14" (hex). [For similar
reasons I think the use of hexadecimal character entities rather
than decimal character entities should be encouraged in XML & HTML]
----- Original Message -----
From: Martin J. Duerst <email@example.com>
To: Unicode List <firstname.lastname@example.org>
Cc: Unicode List <email@example.com>
Sent: Wednesday, January 26, 2000 7:11 AM
Subject: Re: Plane 14 language tags
> At 09:11 00/01/19 -0800, Christopher John Fynn wrote:
> > One reason it might be a good idea to encourage the use of plane 14
> > tags rather than tagging schemes outside Unicode is that once you have
> > external tags it becomes almost as easy to support a set of separate
> > encoding standards for individual languages as it is to support Unicode.
> Chris - I think this is completely wrong. Please have a look at HTML
> and XML. Using mixed encodings in those cases would be a true nightmare.
> Regards, Martin.
> #-#-# Martin J. Du"rst, World Wide Web Consortium
> #-#-# mailto:firstname.lastname@example.org http://www.w3.org
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT