Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sun Sep 11 2005 - 22:10:58 CDT

Next message: Doug Ewell: "Re: Languages supported by UTF8 and UTF16"

Previous message: Richard Wordingham: "Re: Languages supported by UTF8 and UTF16"
In reply to: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Antoine Leca: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Jukka K. Korpela wrote:
> On Sat, 10 Sep 2005, Mark Davis wrote:
>
>> Beyond that, there are problems in your formulation.
>
>
> Certainly. It was just an initial attempt to list down some issues, in
> order to find a useful answer to the question "what languages does
> Unicode support". I'm assuming that the question is relatively frequent.
> It's important at least indirectly: there are many short descriptions of
> Unicode that make the unqualified, unrestricted claim that Unicode
> supports all languages, or something similar.
>
>>> All living languages, and many dead languages, can be written in their
>>> normal writing system(s) using Unicode characters.
>>
>>
>> 1. It is not true of "all living languages"; there are some minority
>> languages that need additional characters.
>
>
> My tentative answer was much simplified, and it plays with the word
> "normal". What worries me is how this thing could be expressed so the
> common man, or maybe even a pointy-haired boss, would get roughly the
> right idea. "All living languages" is too much, and "Almost all living
> languages" is rather vague. "Minority language" means here probably much
> smaller than the average man thinks.

This is a difficult item. You're right that minority here means
something *exceedingly* rare, in terms of the volume of either modern
printed material or text stored on computers.

One could try to formulate it as:

Unicode supports* essentially all modern languages of the world.
or
Unicode supports all modern languages of the world, except for some very
rare languages.

I'm not quite happy about either of these.

(* Supporting a language X means that plain text in the customary
writing systems for X can be represented as a sequence of Unicode
characters.)

>
>>> However, some
>>> of their characters cannot be represented as single Unicode characters
>>> but as combinations.
>>
>>
>> 3. The 'however' is misleading. It is not a deficiency that some of
>> what users may perceive of as separate characters are encoded by
>> sequences.
>
>
> I know it can be misleading, but if we just say that Unicode provides a
> unique number for every character, it's misleading, too. And even
> incorrect. ( http://www.unicode.org/standard/WhatIsUnicode.html )
>
> After one gets acquainted with Unicode, it probably comes as a surprise
> to most people that some characters (using a person's intuitive
> understanding of "character") cannot, after all, be represented
> using a single Unicode code point.
>
> As a thought experiment, let us suppose that the letter "w" had not been
> included into ASCII or other character codes but written as "vv", and
> that Unicode had not changed this. When people would then ask for "w",
> the answer would be that it is just a typographic variant of "vv", a
> ligature
> (as it historically is, in fact). Maybe after quite some debate we would
> then be told to use the combination of three Unicode characters,
> "v", word joiner, and "v". Could we then _really_ say that Unicode
> supports the English alphabet, for example, and a separate code point is
> not needed for "w"?
>
> Unicode contains _most_ accented letters used in human languages
> as precomposed characters, but not all. There's a clear distinction here.
> I'm not questioning the policy decision that effectively freezes the set
> of such precomposed characters. What I'm saying is that we should admit
> that it implies differences on how languages are supported.

The Unicode notion of 'character' may differ from users' perceptions, as
noted earlier. The average person from where I came from considers 'a'
and 'A' to be the same character; they are just lowercase and uppercase
variants of the same thing. One could have hewn more closely to *that*
model, by encoding 'a' as the character, with a following U+XXXX FORMAT
UPPERCASE format character to specify that it should use the uppercase
form in display.

While there are certainly disadvantages to that model (not in the least
compatibility with ASCII), it is a tenable one. And some operations
would actually be easier; caseless compare would be simply ignoring the
FORMAT UPPERCASE character.

But the fact that we chose a model that deviates more in some respects
from user perceptions doesn't mean that we made a worse choice. And
getting back to the combining sequences, with modern technology, it is
really not significantly harder to deal with them than it is with a
single precomposed character.

>
>>> Some orthographic and typographic constructs, which
>>> could in principle be expressed in plain text, cannot be expressed
>>> in Unicode.
>>
>>
>> 4. Also not a deficiency. If Unicode attempted to encode all
>> typographic constructs, it would be a horrible mess. It provides a
>> foundation for other mechanisms (CSS, etc) to build upon; they can
>> provide typographical constructs. And by 'orthographic constructs',
>> you'd have to provide examples of what you mean.
>
>
> Here, too, my text was supposed to address people's intuitive
> expectations rather than take a position on what should or should not be
> encoded in Unicode. For example, since the acute accent used in French,
> the accent used in Polish, and the tonos in Greek normally look
> different from each other, it is natural to expect that they are treated
> as different marks.
> Unicode may have made the right choice in unifying all of them to acute
> accent, but it still means that a difference that could have been - and
> should have been made in some people's opinion - made in plain text
> cannot be made in Unicode. If someone says that Unicode does not support
> Polish because Unicode does not recognize the Polish accent mark as
> distinct from the French mark, he might well be wrong by some criteria;
> but it's still an opinion that people have and that is ultimately not a
> completely objective matter.

By that account, we would have had separate code points for 'a' in
Palatino, and Garamond, and Arial, and Times Roman, and ... This is
ultimately not a productive course of action. Is an average Frenchman
going to look at an 'è' typeset according to Polish conventions and say
to himself: "Hmmm. This looked like an e-grave, but the angle is
slightly different, so it must not actually be an e-grave; it must be
some Polish character." I don't think so. At most, he might think:
"Hmmm. An e-grave, but with a curiously listless air, missing a certain
je ne sais quoi. But artistically I do find it somewhat distressing,
thus I must console myself with a good red Burgundy."

Any modern font technology can use language information to tune the
appearance for the fastidious. That language information could be
carried with the text in a rich text format; it does not have to be
encoded in characters.

>
>>> Some of the properties of characters as defined by the
>>> Unicode Standard do not correspond to their behavior in different
>>> languages.
>>
>>
>> 5. Again, you'd have to provide examples to clarify what you mean.
>
>
> For example, line breaking behavior. Unicode line breaking rules allow,
> for example, a line break after ":" in a string like "YK:ssa". Yet, that
> string is the way to write an inflected form of an abbreviation in
> Finnish, and such an expression should not be divided, and if it really
> needs to be divided, the break must not appear after the colon but at
> syllable boundary ("YK:s-sa"). I take this example, since I have had to
> fight with such problems when fixing the typesetting of books, when the
> typesetting program supported this part of Unicode line breaking rules
> but no way to override them except with awkward trick.
>
> In this example, the point is that the default line breaking rules break
> the processing of texts, by introducing allowed break points that might
> be OK in many contexts, but not in others. I know that such rules are
> not normative and they are meant to be defaults that can be overridden
> as needed, but that's largely just theory. The point is that by
> introducing such properties, Unicode has created, for work with text at
> the plain text level, problems that didn't exist in earlier character
> codes. (Of course, the default line breaking rules _solve_ many
> problems, too.)

I'm a bit taken aback. You think that the situation with line break
before Unicode was somehow better?! (opportunity for interobang ;-)
Differences in line break between those languages that could be written
in 8859-1 were not magically handled by implementations of 8859-1. If
anything, you got much less predictable behavior across different
platforms between languages before Unicode.

Unicode provides a mechanism which *allows* the differences to be
specified by reference to a default; but it doesn't require it. Just to
reiterate, the Unicode properties do not at all constrain
language-sensitive line break. Such variation was left up to vendors

By coincidence, the CLDR committee recently approved the addition of
text segmentation (including line break). You can see, if you wish, the
draft documentation of it in

http://unicode.org/cldr/data/docs/web/tr35.html#segmentations_element

Now, I do think we can make further improvements. For example, it is
*much* easier to customize boundaries by customizing property values
than by customizing rules. So it may possibly make sense for us to add
rules to the default to make it easier to customize. For example, in

http://www.unicode.org/reports/tr29/#Word_Boundaries

we could add the rule 5a listed in the notes below, but with the
property sets 'apostrophe' and 'vowels' given the empty set in the
default case.

>
>>> Moreover, Unicode is meant to describe plain text only, so it generally
>>> lacks any support that might be needed for display and processing of
>>> text by language-specific rules.
>>
>>
>> 6. Again, by design, to avoid above-mentioned horrible mess. If you
>> want language tagging so as to customize appearance for different
>> languages, use higher level markup or structure, such as xml:lang or
>> equivalent.
>
>
> My statement was meant to explain, and perhaps apologize a bit, rather
> than to present the issue as a drawback of Unicode. Here, too, we have a
> problem of expectations. Unicode has, after all, quite a many features
> that actually operate on a level higher than plain text, such as
> typographic variants encoded as characters, even variant selectors, and
> language tags that are not recommended but still exist in Unicode.
> The question then arises why this or that feature is absent.
> And in reality, this means that the level of support to languages varies
> by language. If someone expresses this by saying that Unicode does not
> support this or that language, because Unicode does not recognize some
> difference as a character difference but just as a glyph difference that
> does not affect the coding of characters, I would say that the opinion
> is not quite right - but it is understandable, and it demonstrates a bit
> how relative the "support to a language" concept is.

I agree that this is a fuzzy area; and part of the fuzziness when we say
that Unicode supports the language X in plain-text, is just what we mean
by 'supports' and 'plain-text' (and for that matter, 'language': my
favorite definition being "A shprakh iz a diyalekt mit an armey un a
flot" -- Max Weinreich (some say Joshua Fishman)).

But fundamentally, I don't think that Unicode plays favorites among
languages. What mechanisms there are in it are those that the committee
felt were the minimal ones for expressing plain-text, and which dealt
with the many issues involved with backwards compatibility and
consistency of implementation. In hindsight, as with any human
enterprise, there are clearly areas where we could have done things more
cleanly (I have my own list ;-), but on the whole we have ended up with
a very serviceable approach to a very difficult problem.

Mark

Next message: Doug Ewell: "Re: Languages supported by UTF8 and UTF16"
Previous message: Richard Wordingham: "Re: Languages supported by UTF8 and UTF16"
In reply to: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Antoine Leca: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 11 2005 - 22:12:45 CDT