Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 15:56:10 CDT

  • Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"

    Michael Everson wrote:
    > At 12:23 -0700 2005-09-10, Mark Davis wrote:
    >
    >> 1. It is not true of "all living languages"; there are some minority
    >> languages that need additional characters. (Part of the problem here
    >> is that we didn't apply the generative model consistently enough; had
    >> we done that, many of these characters could be represented right now
    >> by sequences.)
    >
    >
    > Well you'd have to give examples of what you mean by THAT, Mark.

    No problem. One example: the SIL proposed
    04FA CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK
    could be represented by <U+0413, U+0335, U+0321>.

    There are many other examples in Arabic. Had we chosen the same
    mechanism for Arabic that we did for Latin (eg define common characters
    as precompositions, and resolution to those in NFC, but also supply
    generative mechanisms for others), then minority writing systems using
    Arabic wouldn't have to wait for years to have characters encoded for them.

    Moreover, we would have avoided security issues with these kinds of
    characters at the same time. See the examples in
    http://www.unicode.org/reports/tr36/#Single_Script_Spoofing

    >
    >> 3. The 'however' is misleading. It is not a deficiency that some of
    >> what users may perceive of as separate characters are encoded by
    >> sequences.
    >
    >
    > No, but it's a problem, because font guys usually precompose, and only
    > precomposed glyphs are **guaranteed** 'safe' for good, consistent
    > typography.

    As you well know, what is a precomposed glyph in a font is orthogonal to
    what is a precomposed character in Unicode. For example, a font can have
    a precomposed glyph for

    LATIN CAPITAL LETTER A WITH MACRON AND GRAVE

    while it is represented in Unicode by <U+0100 U+0300>. (This is one of
    many listed in http://unicode.org/Public/UNIDATA/NamedSequences.txt)

    >
    >> 4. Also not a deficiency. If Unicode attempted to encode all
    >> typographic constructs, it would be a horrible mess. It provides a
    >> foundation for other mechanisms (CSS, etc) to build upon; they can
    >> provide typographical constructs. And by 'orthographic constructs',
    >> you'd have to provide examples of what you mean.
    >
    >
    > What's a typographical construct, Mark?

    I didn't introduce the term to the discussion: Jukka did. My reading of
    it was italic, bold, superscript, underline, etc. If he means something
    different than that, he could explain and provide an example.

    >
    >> > Some of the properties of characters as defined by the
    >> > Unicode Standard do not correspond to their behavior in different
    >> > languages.
    >>
    >> 5. Again, you'd have to provide examples to clarify what you mean.
    >
    >
    > He probably means something like Russian-vs-Serbian italic small TE.

    That's insufficient. The original statement for which examples are
    needed are "properties of characters as defined by the Unicode Standard
    do not correspond to their behavior". An example of that needs to
    describe what the purportedly incorrect properties of this character are.

    >
    >> What the Unicode Consortium *does* provide is a mechanism for
    >> providing language-specific tailorings of specified behavior. Look at
    >> collation, for example, where the Unicode Consortium supplies a
    >> default basis for ordering in the UCA, but then also provides a
    >> repository of language-based tailorings of the UCA in the CLDR.
    >
    >
    > Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY
    > platform.

    I didn't say 'user-tailorable', I said 'language-specific tailorings'.
    These are two very different things. *All* significant modern platforms
    offer language-specific tailorings.

    As to the orthogonal issue of user-tailorable collation: certainly the
    technology is available to customize locales on the user level. For example:

    1. Go to
    http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col

    2. In the custom rules box, type (or copy & paste):
    & c < b <<< B
    & everyone < Everson

    3. In the source box, add a few strings, like:
    Everson
    everyone
    Everyone

    4. Click on the Sort button. You'll see your desired ordering in the
    Collated box.

    However, collations are very tricky to specify correctly, because of all
    the issues described in
    http://www.unicode.org/reports/tr10/#Introduction, so it is no surprise
    to me that platforms don't choose to offer this as a user-level option.



    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 15:58:41 CDT