Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 14:23:32 CDT

Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"

Previous message: Rein: "Re: [ATypI] IJ"
In reply to: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The core value in your proposed formulation is that it does make sense
to talk about 'Unicode supporting a language X' as meaning something
like 'plain text in the customary writing system for X can be
represented as a sequence of Unicode characters'.

Beyond that, there are problems in your formulation.

> All living languages, and many dead languages, can be written in their
> normal writing system(s) using Unicode characters.

1. It is not true of "all living languages"; there are some minority
languages that need additional characters. (Part of the problem here is
that we didn't apply the generative model consistently enough; had we
done that, many of these characters could be represented right now by
sequences.)

> However, some
> of their characters cannot be represented as single Unicode characters
> but as combinations.

3. The 'however' is misleading. It is not a deficiency that some of what
users may perceive of as separate characters are encoded by sequences.

> Some orthographic and typographic constructs, which
> could in principle be expressed in plain text, cannot be expressed
> in Unicode.

4. Also not a deficiency. If Unicode attempted to encode all typographic
constructs, it would be a horrible mess. It provides a foundation for
other mechanisms (CSS, etc) to build upon; they can provide
typographical constructs. And by 'orthographic constructs', you'd have
to provide examples of what you mean.

> Some of the properties of characters as defined by the
> Unicode Standard do not correspond to their behavior in different
> languages.

5. Again, you'd have to provide examples to clarify what you mean.

> Moreover, Unicode is meant to describe plain text only, so it generally
> lacks any support that might be needed for display and processing of
> text by language-specific rules.

6. Again, by design, to avoid above-mentioned horrible mess. If you want
language tagging so as to customize appearance for different languages,
use higher level markup or structure, such as xml:lang or equivalent.

What the Unicode Consortium *does* provide is a mechanism for providing
language-specific tailorings of specified behavior. Look at collation,
for example, where the Unicode Consortium supplies a default basis for
ordering in the UCA, but then also provides a repository of
language-based tailorings of the UCA in the CLDR.

Mark

Jukka K. Korpela wrote:
> On Fri, 9 Sep 2005, Doug Ewell wrote:
>
>> I'm afraid the list is at risk of falling into a hole debating this "how
>> many languages on the head of a pin" question, when the real underlying
>> question may be completely different.
>
>
> Indeed, especially since the question was probably based on a
> misconception on one thing at least, since it asked about encoding forms
> and not Unicode.
>
> While waiting for a clarification to the question, we can still discuss
> _another_ question, namely that of language support by Unicode. There
> seems to be confusion around it, too, and the question itself is
> somewhat obscure. For example, does "Unicode" mean the Unicode
> repertoire of characters, or the Unicode Standard, or the Unicode
> Consortium?
>
> I'd say that the short answer to the question "what languages are
> supported by the Unicode Standard?" would be as follows (without trying
> to clarify the question much - can't do that in a _short_ answer):
>
> All living languages, and many dead languages, can be written in their
> normal writing system(s) using Unicode characters. However, some
> of their characters cannot be represented as single Unicode characters
> but as combinations. Some orthographic and typographic constructs, which
> could in principle be expressed in plain text, cannot be expressed
> in Unicode. Some of the properties of characters as defined by the
> Unicode Standard do not correspond to their behavior in different
> languages.
> Moreover, Unicode is meant to describe plain text only, so it generally
> lacks any support that might be needed for display and processing of
> text by language-specific rules.
>
> Well, that's not very short, really. Neither is it very understandable,
> since it lacks examples. The point, anyway, is that "support to a
> language" can mean much more than just presence of all characters used
> in a language. It's also debatable, since people may disagree on what
> really belongs to a language, even at the character level. Moreover,
> it's debatable what can be regarded as "support". For example, if the
> rules of a language require a thin nonbreakable space before or after
> some punctuation marks, can we claim that Unicode "supports" it, since
> you can use a thin space character with a zero width no-break space on
> both sides of it?
>

Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Previous message: Rein: "Re: [ATypI] IJ"
In reply to: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 14:24:39 CDT