Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 18:18:06 CDT

Next message: Patrick Andries: "Re: Languages supported by UTF8 and UTF16"

Previous message: Anto'nio Martins-Tuva'lkin: ""CE" mark"
In reply to: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Marion Gunn: "Re: Languages supported by UTF8 and UTF16"
Reply: Marion Gunn: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

comments below.

Michael Everson wrote:
> At 13:56 -0700 2005-09-10, Mark Davis wrote:
>
>>>> (Part of the problem here is that we didn't apply the generative
>>>> model consistently enough; had we done that, many of these
>>>> characters could be represented right now by sequences.)
>>>
>>>
>>> Well you'd have to give examples of what you mean by THAT, Mark.
>>
>>
>> No problem. One example: the SIL proposed 04FA CYRILLIC CAPITAL LETTER
>> GHE WITH STROKE AND HOOK could be represented by <U+0413, U+0335,
>> U+0321>.
>
>
> Yes, but that generative model sucks, which is why we don't use it. At a
> minimum the overlays can cause winding errors with white space over the
> overlapping bits.

Winding errors have nothing to do with the issue. As below, there is no
implication that a sequence of characters has to be represented by the
corresponding sequence of glyphs.

There is nothing standing in the way of having <U+0413, U+0335, U+0321>
be represented by a precomposed glyph in a font.

>
> Personally I am a fan of precomposed glyphs (as people have been since
> the dawn of printing). They are problematic for our users, so if we can
> limit the problem at least by not going for the overlays, that's something.
>
>> There are many other examples in Arabic.
>
>
> Which is a completely different thing. I disagree.

Not a different thing, not when you keep in mind that char sequence !=
glyph sequence.

>
>> Had we chosen the same mechanism for Arabic that we did for Latin (eg
>> define common characters as precompositions, and resolution to those
>> in NFC, but also supply generative mechanisms for others), then
>> minority writing systems using Arabic wouldn't have to wait for years
>> to have characters encoded for them.
>
>
> I disagree. What I do wish is that normalization hadn't been locked down
> before Africa's needs were dealt with.

As to the purported premature lock-down, it's a moot point but had we
not locked down NFC, it would have not been tenable for anyone to use
it. (Think of it as like code point numbers. If you don't fix the code
point of a new character, but just make it 'tentative', nobody will
implement it; it might as well be in the PUA.) That would have had bad
consequences for security and any other processing of character
equivalence through a wide variety of dependent technologies.

And simply because something is a precomposed character doesn't make it
automagically supported by vendors. It is, in fact, *faster* for vendors
to support sequences by means of precomposed glyphs in fonts, rather
than wait for a precomposed character to be encoded.

If CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK were represented by
<U+0413, U+0335, U+0321> it could have been supported *years ago*,
instead of waiting for the long process of encoding.

> Now, thank goodness, we have
> "named sequences" which will guide font developers, and there will, I
> promise you, be a good many African named sequences standardized to give
> font developers the guidance African users need them to have.

I think we are in agreement on named sequences; they should give
guidance to font developers as to which char sequences may need a
precomposed glyph.

>
>> Moreover, we would have avoided security issues with these kinds of
>> characters at the same time. See the examples in
>> http://www.unicode.org/reports/tr36/#Single_Script_Spoofing
>
>
> Um, well, the security issues are your bugaboo, and they are restricted
> to a narrow range of activity vis à vis the UCS.

People's cavalier attitudes towards security fade the first time they
(or a relative or friend) are swindled due to security problems. The
goal is to get structure in place to prevent the problems before they
happen. Levees are boring too, until they fail.

>
>>> o, but it's a problem, because font guys usually precompose, and only
>>> precomposed glyphs are **guaranteed** 'safe' for good, consistent
>>> typography.
>>
>>
>> As you well know, what is a precomposed glyph in a font is orthogonal
>> to what is a precomposed character in Unicode. For example, a font can
>> have a precomposed glyph for
>>
>> LATIN CAPITAL LETTER A WITH MACRON AND GRAVE
>>
>> while it is represented in Unicode by <U+0100 U+0300>. (This is one of
>> many listed in http://unicode.org/Public/UNIDATA/NamedSequences.txt)
>
>
> The problem (if you haven't been paying attention) is that a lot of
> people have precomposed requirements that aren't met by precomposed
> glyphs because font guys don't know what to draw. Europe is lucky; all
> the important letters are precomposed. Africa is unlucky; the 19 million
> Yoruba speakers do NOT have ANY support for their letters from ANY of
> the three main computer platforms (Windows, Mac, Linux).

That again is disconnected from the character encoding. Just because
something is in Unicode as a precomposed character is no guarantee that
any particular vendor will add a corresponding glyph to their font.
Adding a character to Unicode does not magically make fonts for it.

If you really want to see this addressed, the best way is to contribute
to NamedSequences' containing listings of the sequences needed for
minority languages.

>
>>> Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY
>>> platform.
>>
>>
>> I didn't say 'user-tailorable', I said 'language-specific tailorings'.
>> These are two very different things. *All* significant modern
>> platforms offer language-specific tailorings.
>
>
> For a very very very very very small number of languages. What do we do
> about that?

First, it's not particularly useful to look at raw number of possible
languages, past and present; it doesn't really matter to many people how
Old Italic sorts. If you measure language coverage by the proportion of
text on say, the Internet (say from
http://global-reach.biz/globstats/index.php3), then the CLDR coverage is
very large, down to languages that only cover 0.02% of the world's
online population.

Second, your claim that it is a small number (I won't repeat the very's)
depends on an assertion that the UCA doesn't handle those languages out
of the box. It would be interesting to see your count of which languages
those are, and how you arrived at that figure.

The consortium has a mechanism for language-specific collations in CLDR.
You and anyone else are free to contribute collation sequences for
different languages. It does take some work to get the right
specification, but if you care about some particular languages, you can
make a difference.

>
>> As to the orthogonal issue of user-tailorable collation: certainly the
>> technology is available to customize locales on the user level. For
>> example:
>>
>> 1. Go to
>> http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col
>>
>>
>> 2. In the custom rules box, type (or copy & paste):
>> & c < b <<< B
>> & everyone < Everson
>>
>> 3. In the source box, add a few strings, like:
>> Everson
>> everyone
>> Everyone
>>
>> 4. Click on the Sort button. You'll see your desired ordering in the
>> Collated box.
>
>
> For a start the default collation orders everson before Everson and god
> before God, which is not preferable. The English alphabet is always
> presented Aa Bb Cc not aA bB cC (watch the Simpsons to see) and so this
> is A Bad Thing. When I click in English, I get the same thing, and this
> is NOT what Oxford practice specifies. Then when I click on Ireland or
> the UK it is still wrong.
>

1. English (and many other languages) don't have absolute requirements
on the order of case variants; different sources, different
dictionaries, disagree (sadly, the Simpsons might not count as
authoritative ;-).

2. The mechanisms are there to handle this in CLDR. If you want to see a
demo, do the same thing as above, but in the Options list under the
second item, choose "Force Uppercase First". (This can also be
incorporated into the rules for specific locales.)

> I am not very happy with CLDR in this regard.

File a bug. Ask to be on the agenda for a meeting. If you can argue
persuasively that upper before lower is more customary in the UK and IE,
then I'm sure the committee would make the (one-line) change.

And even if it didn't, the committee has been discussing having locale
ID variants for different collation settings. Those would allow for
easily specifying desired variants.

>
>> However, collations are very tricky to specify correctly, because of
>> all the issues described in
>> http://www.unicode.org/reports/tr10/#Introduction, so it is no
>> surprise to me that platforms don't choose to offer this as a
>> user-level option.
>
>
> I agree with you about that.

Next message: Patrick Andries: "Re: Languages supported by UTF8 and UTF16"
Previous message: Anto'nio Martins-Tuva'lkin: ""CE" mark"
In reply to: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Marion Gunn: "Re: Languages supported by UTF8 and UTF16"
Reply: Marion Gunn: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 18:19:04 CDT