Re: Languages supported by UTF8 and UTF16

From: Mark Davis (mark.davis@icu-project.org)
Date: Sat Sep 10 2005 - 15:56:10 CDT

Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"

Previous message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
In reply to: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Michael Everson wrote:
> At 12:23 -0700 2005-09-10, Mark Davis wrote:
>
>> 1. It is not true of "all living languages"; there are some minority
>> languages that need additional characters. (Part of the problem here
>> is that we didn't apply the generative model consistently enough; had
>> we done that, many of these characters could be represented right now
>> by sequences.)
>
>
> Well you'd have to give examples of what you mean by THAT, Mark.

No problem. One example: the SIL proposed
04FA CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK
could be represented by <U+0413, U+0335, U+0321>.

There are many other examples in Arabic. Had we chosen the same
mechanism for Arabic that we did for Latin (eg define common characters
as precompositions, and resolution to those in NFC, but also supply
generative mechanisms for others), then minority writing systems using
Arabic wouldn't have to wait for years to have characters encoded for them.

Moreover, we would have avoided security issues with these kinds of
characters at the same time. See the examples in
http://www.unicode.org/reports/tr36/#Single_Script_Spoofing

>
>> 3. The 'however' is misleading. It is not a deficiency that some of
>> what users may perceive of as separate characters are encoded by
>> sequences.
>
>
> No, but it's a problem, because font guys usually precompose, and only
> precomposed glyphs are **guaranteed** 'safe' for good, consistent
> typography.

As you well know, what is a precomposed glyph in a font is orthogonal to
what is a precomposed character in Unicode. For example, a font can have
a precomposed glyph for

LATIN CAPITAL LETTER A WITH MACRON AND GRAVE

while it is represented in Unicode by <U+0100 U+0300>. (This is one of
many listed in http://unicode.org/Public/UNIDATA/NamedSequences.txt)

>
>> 4. Also not a deficiency. If Unicode attempted to encode all
>> typographic constructs, it would be a horrible mess. It provides a
>> foundation for other mechanisms (CSS, etc) to build upon; they can
>> provide typographical constructs. And by 'orthographic constructs',
>> you'd have to provide examples of what you mean.
>
>
> What's a typographical construct, Mark?

I didn't introduce the term to the discussion: Jukka did. My reading of
it was italic, bold, superscript, underline, etc. If he means something
different than that, he could explain and provide an example.

>
>> > Some of the properties of characters as defined by the
>> > Unicode Standard do not correspond to their behavior in different
>> > languages.
>>
>> 5. Again, you'd have to provide examples to clarify what you mean.
>
>
> He probably means something like Russian-vs-Serbian italic small TE.

That's insufficient. The original statement for which examples are
needed are "properties of characters as defined by the Unicode Standard
do not correspond to their behavior". An example of that needs to
describe what the purportedly incorrect properties of this character are.

>
>> What the Unicode Consortium *does* provide is a mechanism for
>> providing language-specific tailorings of specified behavior. Look at
>> collation, for example, where the Unicode Consortium supplies a
>> default basis for ordering in the UCA, but then also provides a
>> repository of language-based tailorings of the UCA in the CLDR.
>
>
> Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY
> platform.

I didn't say 'user-tailorable', I said 'language-specific tailorings'.
These are two very different things. *All* significant modern platforms
offer language-specific tailorings.

As to the orthogonal issue of user-tailorable collation: certainly the
technology is available to customize locales on the user level. For example:

1. Go to
http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col

2. In the custom rules box, type (or copy & paste):
& c < b <<< B
& everyone < Everson

3. In the source box, add a few strings, like:
Everson
everyone
Everyone

4. Click on the Sort button. You'll see your desired ordering in the
Collated box.

However, collations are very tricky to specify correctly, because of all
the issues described in
http://www.unicode.org/reports/tr10/#Introduction, so it is no surprise
to me that platforms don't choose to offer this as a user-level option.

Next message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Previous message: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
In reply to: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Reply: Michael Everson: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 15:58:41 CDT