Re: Embedded language ID pr

From: Asmus Freytag (
Date: Tue Sep 12 1995 - 01:51:14 EDT

We need to separate a few issues here.

1) use and purpose of language identifiers vs. other types of tags
2) standardization of ids as such (.i.e which small integer means which
3) encoding of any language identifiers for use in a) plain text and b)
other protocols

We also need to distinguish between Unicode(R), the Consortium, and
Unicode(tm) the Standard. The former may well take up issues 1-3.
But not in form of inclusion into the latter, but as separate
items, e.g. technical reports. [Only issue 3a) could even remotely
become part of the standard, but that is a statement of scope, not of

To points raised in this posting:

>> Currently, our use of language identifiers is somewhat limited.
>> are used for things like determining codeset and font (no widespread

the above are clearly contrary to anything the UTC might tackle

>> deployment of Unicode yet), when to switch segmentation algorithms,

this is the kind of example where Unicode's character semantics, which
are defined in a language neutral way, could in the long term be
augmented so that language specific variations could be specified.
Treatment of numbers in a Hebrew, vs. Arabic context is one such

>> and flags to invoke other, language-specific tools
>> (e.g. spell-checking, sorting, morphological analysis).
>> Another concern is that the adoption of a language id approach in a
>> codeset standard might act as a bad precedent. It could open doors
>> for other features that don't really belong in a codeset standard.

>As I see it, the language identifier is actually part of a greater
>range of information that you need for a text, such as how are
>numbers represented, date formats etc. This is also known as the
>locale in C and POSIX terms. There is a general need to know
>which locale any text should be understood by. This information
>can be given out-of-band or in-stream. What I would propose is
>a standardized way to invoke a locale in-stream to solve the
>As also noted above there is a need for this capacity also outside
>UNICODE/10646 and thus I think that UNICODE/10646 encoding is not
>the right way to standardize it in.

One certainly does NOT want to assign Unicode code points directly
as language tags as the first step. This just muddles everything.
However, I believe the UTC should discuss this issue and consider
whether it might be worthwhile to come up with a position regarding
language identifiers (issue 1) from above), especially in what way the
knowledge of text language would influence the _interpretation_ of a
stream of Unicode encoded text. UTC could then review existing sources
of language identifiers and recommend improvements, up to and including
setting up its own list as last resort. None of this would mean that
these are incorporated into the Unicode Standard as such.

Only if, (and that's a big if) there is a compelling need to create
yet another form of mark-up syntax, could a data-stream format be part
of the discussion. I tend to think that the existence of a lot of
mark-up formats would mean that another one is superfluous, but am
willing to see a serious discssion.

There certainly seems to be a bit of interest in this topic, judging
from the mail volume.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT