Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 18:18:06 CDT

  • Next message: Patrick Andries: "Re: Languages supported by UTF8 and UTF16"

    comments below.

    Michael Everson wrote:
    > At 13:56 -0700 2005-09-10, Mark Davis wrote:
    >
    >>>> (Part of the problem here is that we didn't apply the generative
    >>>> model consistently enough; had we done that, many of these
    >>>> characters could be represented right now by sequences.)
    >>>
    >>>
    >>> Well you'd have to give examples of what you mean by THAT, Mark.
    >>
    >>
    >> No problem. One example: the SIL proposed 04FA CYRILLIC CAPITAL LETTER
    >> GHE WITH STROKE AND HOOK could be represented by <U+0413, U+0335,
    >> U+0321>.
    >
    >
    > Yes, but that generative model sucks, which is why we don't use it. At a
    > minimum the overlays can cause winding errors with white space over the
    > overlapping bits.

    Winding errors have nothing to do with the issue. As below, there is no
    implication that a sequence of characters has to be represented by the
    corresponding sequence of glyphs.

    There is nothing standing in the way of having <U+0413, U+0335, U+0321>
    be represented by a precomposed glyph in a font.

    >
    > Personally I am a fan of precomposed glyphs (as people have been since
    > the dawn of printing). They are problematic for our users, so if we can
    > limit the problem at least by not going for the overlays, that's something.
    >
    >> There are many other examples in Arabic.
    >
    >
    > Which is a completely different thing. I disagree.

    Not a different thing, not when you keep in mind that char sequence !=
    glyph sequence.

    >
    >> Had we chosen the same mechanism for Arabic that we did for Latin (eg
    >> define common characters as precompositions, and resolution to those
    >> in NFC, but also supply generative mechanisms for others), then
    >> minority writing systems using Arabic wouldn't have to wait for years
    >> to have characters encoded for them.
    >
    >
    > I disagree. What I do wish is that normalization hadn't been locked down
    > before Africa's needs were dealt with.

    As to the purported premature lock-down, it's a moot point but had we
    not locked down NFC, it would have not been tenable for anyone to use
    it. (Think of it as like code point numbers. If you don't fix the code
    point of a new character, but just make it 'tentative', nobody will
    implement it; it might as well be in the PUA.) That would have had bad
    consequences for security and any other processing of character
    equivalence through a wide variety of dependent technologies.

    And simply because something is a precomposed character doesn't make it
    automagically supported by vendors. It is, in fact, *faster* for vendors
    to support sequences by means of precomposed glyphs in fonts, rather
    than wait for a precomposed character to be encoded.

    If CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK were represented by
    <U+0413, U+0335, U+0321> it could have been supported *years ago*,
    instead of waiting for the long process of encoding.

    > Now, thank goodness, we have
    > "named sequences" which will guide font developers, and there will, I
    > promise you, be a good many African named sequences standardized to give
    > font developers the guidance African users need them to have.

    I think we are in agreement on named sequences; they should give
    guidance to font developers as to which char sequences may need a
    precomposed glyph.

    >
    >> Moreover, we would have avoided security issues with these kinds of
    >> characters at the same time. See the examples in
    >> http://www.unicode.org/reports/tr36/#Single_Script_Spoofing
    >
    >
    > Um, well, the security issues are your bugaboo, and they are restricted
    > to a narrow range of activity vis à vis the UCS.

    People's cavalier attitudes towards security fade the first time they
    (or a relative or friend) are swindled due to security problems. The
    goal is to get structure in place to prevent the problems before they
    happen. Levees are boring too, until they fail.

    >
    >>> o, but it's a problem, because font guys usually precompose, and only
    >>> precomposed glyphs are **guaranteed** 'safe' for good, consistent
    >>> typography.
    >>
    >>
    >> As you well know, what is a precomposed glyph in a font is orthogonal
    >> to what is a precomposed character in Unicode. For example, a font can
    >> have a precomposed glyph for
    >>
    >> LATIN CAPITAL LETTER A WITH MACRON AND GRAVE
    >>
    >> while it is represented in Unicode by <U+0100 U+0300>. (This is one of
    >> many listed in http://unicode.org/Public/UNIDATA/NamedSequences.txt)
    >
    >
    > The problem (if you haven't been paying attention) is that a lot of
    > people have precomposed requirements that aren't met by precomposed
    > glyphs because font guys don't know what to draw. Europe is lucky; all
    > the important letters are precomposed. Africa is unlucky; the 19 million
    > Yoruba speakers do NOT have ANY support for their letters from ANY of
    > the three main computer platforms (Windows, Mac, Linux).

    That again is disconnected from the character encoding. Just because
    something is in Unicode as a precomposed character is no guarantee that
    any particular vendor will add a corresponding glyph to their font.
    Adding a character to Unicode does not magically make fonts for it.

    If you really want to see this addressed, the best way is to contribute
    to NamedSequences' containing listings of the sequences needed for
    minority languages.

    >
    >>> Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY
    >>> platform.
    >>
    >>
    >> I didn't say 'user-tailorable', I said 'language-specific tailorings'.
    >> These are two very different things. *All* significant modern
    >> platforms offer language-specific tailorings.
    >
    >
    > For a very very very very very small number of languages. What do we do
    > about that?

    First, it's not particularly useful to look at raw number of possible
    languages, past and present; it doesn't really matter to many people how
    Old Italic sorts. If you measure language coverage by the proportion of
    text on say, the Internet (say from
    http://global-reach.biz/globstats/index.php3), then the CLDR coverage is
    very large, down to languages that only cover 0.02% of the world's
    online population.

    Second, your claim that it is a small number (I won't repeat the very's)
    depends on an assertion that the UCA doesn't handle those languages out
    of the box. It would be interesting to see your count of which languages
    those are, and how you arrived at that figure.

    The consortium has a mechanism for language-specific collations in CLDR.
    You and anyone else are free to contribute collation sequences for
    different languages. It does take some work to get the right
    specification, but if you care about some particular languages, you can
    make a difference.

    >
    >> As to the orthogonal issue of user-tailorable collation: certainly the
    >> technology is available to customize locales on the user level. For
    >> example:
    >>
    >> 1. Go to
    >> http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col
    >>
    >>
    >> 2. In the custom rules box, type (or copy & paste):
    >> & c < b <<< B
    >> & everyone < Everson
    >>
    >> 3. In the source box, add a few strings, like:
    >> Everson
    >> everyone
    >> Everyone
    >>
    >> 4. Click on the Sort button. You'll see your desired ordering in the
    >> Collated box.
    >
    >
    > For a start the default collation orders everson before Everson and god
    > before God, which is not preferable. The English alphabet is always
    > presented Aa Bb Cc not aA bB cC (watch the Simpsons to see) and so this
    > is A Bad Thing. When I click in English, I get the same thing, and this
    > is NOT what Oxford practice specifies. Then when I click on Ireland or
    > the UK it is still wrong.
    >

    1. English (and many other languages) don't have absolute requirements
    on the order of case variants; different sources, different
    dictionaries, disagree (sadly, the Simpsons might not count as
    authoritative ;-).

    2. The mechanisms are there to handle this in CLDR. If you want to see a
    demo, do the same thing as above, but in the Options list under the
    second item, choose "Force Uppercase First". (This can also be
    incorporated into the rules for specific locales.)

    > I am not very happy with CLDR in this regard.

    File a bug. Ask to be on the agenda for a meeting. If you can argue
    persuasively that upper before lower is more customary in the UK and IE,
    then I'm sure the committee would make the (one-line) change.

    And even if it didn't, the committee has been discussing having locale
    ID variants for different collation settings. Those would allow for
    easily specifying desired variants.

    >
    >> However, collations are very tricky to specify correctly, because of
    >> all the issues described in
    >> http://www.unicode.org/reports/tr10/#Introduction, so it is no
    >> surprise to me that platforms don't choose to offer this as a
    >> user-level option.
    >
    >
    > I agree with you about that.



    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 18:19:04 CDT