Re: Languages supported by UTF8 and UTF16

From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Sat Sep 10 2005 - 17:40:50 CDT

Next message: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"

Previous message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
In reply to: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"
Reply: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"
Reply: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sat, 10 Sep 2005, Mark Davis wrote:

> Beyond that, there are problems in your formulation.

Certainly. It was just an initial attempt to list down some issues, in
order to find a useful answer to the question "what languages does Unicode
support". I'm assuming that the question is relatively frequent. It's
important at least indirectly: there are many short descriptions of
Unicode that make the unqualified, unrestricted claim that Unicode
supports all languages, or something similar.

>> All living languages, and many dead languages, can be written in their
>> normal writing system(s) using Unicode characters.
>
> 1. It is not true of "all living languages"; there are some minority
> languages that need additional characters.

My tentative answer was much simplified, and it plays with the word
"normal". What worries me is how this thing could be expressed so the
common man, or maybe even a pointy-haired boss, would get roughly the
right idea. "All living languages" is too much, and "Almost all living
languages" is rather vague. "Minority language" means here probably much
smaller than the average man thinks.

>> However, some
>> of their characters cannot be represented as single Unicode characters
>> but as combinations.
>
> 3. The 'however' is misleading. It is not a deficiency that some of what
> users may perceive of as separate characters are encoded by sequences.

I know it can be misleading, but if we just say that Unicode provides a
unique number for every character, it's misleading, too. And even
incorrect. ( http://www.unicode.org/standard/WhatIsUnicode.html )

After one gets acquainted with Unicode, it probably comes as a surprise to
most people that some characters (using a person's intuitive
understanding of "character") cannot, after all, be represented
using a single Unicode code point.

As a thought experiment, let us suppose that the letter "w" had not been
included into ASCII or other character codes but written as "vv", and that
Unicode had not changed this. When people would then ask for "w", the
answer would be that it is just a typographic variant of "vv", a ligature
(as it historically is, in fact). Maybe after quite some debate we would
then be told to use the combination of three Unicode characters,
"v", word joiner, and "v". Could we then _really_ say that Unicode
supports the English alphabet, for example, and a separate code point is
not needed for "w"?

Unicode contains _most_ accented letters used in human languages
as precomposed characters, but not all. There's a clear distinction here.
I'm not questioning the policy decision that effectively freezes the set
of such precomposed characters. What I'm saying is that we should admit
that it implies differences on how languages are supported.

>> Some orthographic and typographic constructs, which
>> could in principle be expressed in plain text, cannot be expressed
>> in Unicode.
>
> 4. Also not a deficiency. If Unicode attempted to encode all typographic
> constructs, it would be a horrible mess. It provides a foundation for other
> mechanisms (CSS, etc) to build upon; they can provide typographical
> constructs. And by 'orthographic constructs', you'd have to provide examples
> of what you mean.

Here, too, my text was supposed to address people's intuitive expectations
rather than take a position on what should or should not be encoded in
Unicode. For example, since the acute accent used in French, the accent
used in Polish, and the tonos in Greek normally look different from each
other, it is natural to expect that they are treated as different marks.
Unicode may have made the right choice in unifying all of them to acute
accent, but it still means that a difference that could have been - and
should have been made in some people's opinion - made in plain text cannot
be made in Unicode. If someone says that Unicode does not support Polish
because Unicode does not recognize the Polish accent mark as distinct from
the French mark, he might well be wrong by some criteria; but it's still
an opinion that people have and that is ultimately not a completely
objective matter.

>> Some of the properties of characters as defined by the
>> Unicode Standard do not correspond to their behavior in different
>> languages.
>
> 5. Again, you'd have to provide examples to clarify what you mean.

For example, line breaking behavior. Unicode line breaking rules allow,
for example, a line break after ":" in a string like "YK:ssa". Yet, that
string is the way to write an inflected form of an abbreviation in
Finnish, and such an expression should not be divided, and if it really
needs to be divided, the break must not appear after the colon but at
syllable boundary ("YK:s-sa"). I take this example, since I have had to
fight with such problems when fixing the typesetting of books, when the
typesetting program supported this part of Unicode line breaking rules but
no way to override them except with awkward trick.

In this example, the point is that the default line breaking rules break
the processing of texts, by introducing allowed break points that might be
OK in many contexts, but not in others. I know that such rules are not
normative and they are meant to be defaults that can be overridden as
needed, but that's largely just theory. The point is that by introducing
such properties, Unicode has created, for work with text at the plain text
level, problems that didn't exist in earlier character codes. (Of course,
the default line breaking rules _solve_ many problems, too.)

>> Moreover, Unicode is meant to describe plain text only, so it generally
>> lacks any support that might be needed for display and processing of
>> text by language-specific rules.
>
> 6. Again, by design, to avoid above-mentioned horrible mess. If you want
> language tagging so as to customize appearance for different languages, use
> higher level markup or structure, such as xml:lang or equivalent.

My statement was meant to explain, and perhaps apologize a bit, rather
than to present the issue as a drawback of Unicode. Here, too, we have a
problem of expectations. Unicode has, after all, quite a many features
that actually operate on a level higher than plain text, such as
typographic variants encoded as characters, even variant selectors, and
language tags that are not recommended but still exist in Unicode.
The question then arises why this or that feature is absent.
And in reality, this means that the level of support to languages varies
by language. If someone expresses this by saying that Unicode does not
support this or that language, because Unicode does not recognize some
difference as a character difference but just as a glyph difference that
does not affect the coding of characters, I would say that the opinion
is not quite right - but it is understandable, and it demonstrates a bit
how relative the "support to a language" concept is.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Next message: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"
Previous message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
In reply to: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"
Reply: Anto'nio Martins-Tuva'lkin: "Re: Languages supported by UTF8 and UTF16"
Reply: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 17:42:16 CDT