Re: Mixed up priorities - slovak _is_ supported by unicode and software

From: schererm@us.ibm.com
Date: Fri Oct 22 1999 - 15:07:12 EDT


hi,

i just wanted to present some easy evidence for support for slovak in unicode
and associated software.

consider, among many other libraries and languages, Java and the IBM Classes for
Unicode.
i mention these because there is an easy way to play with their language
support:

go to http://www10.software.ibm.com/developerworks/opensource/icu/localeexplorer
and select "Slovak" on that page. then scroll down to the Collation rules. there
is a "demo" button on the right part of that section which leads you to a page
where you can type in strings and have them sorted according to the selected
locale.

the data that is driving this is shared (synchronized) among Sun's JDK and our
IBM Classes for Unicode, the cgi program uses the ICU.

please tell us if there is anything wrong with this
(mailto:icu@www10.software.ibm.com, or file a bug report on the data with
http://java.sun.com/).

to feed the fire some more, i guess that this discussion really hinges around
what unicode encodes and what a character is in this context. from the book and
from the discussions on this list, i believe that unicode attempts to encode a
more or less minimal set of "plain text elements", giving each one a "code
point", just far enough so that such plain text can be legibly rendered as well
as automatically processed.
it is a fact and design feature that a lot of times multiple code points or text
elements are necessary to form a "user character", as this was called on this
list before. for many languages, a base character and one or more combining
characters form a single "user character".

we all know that this minimalistic design was compromised for compatibility with
existing codepages like the iso 8859 series, but we also know that this causes
some grief and is dealt with especially in the normalization forms. without this
compromise, unicode may well still be the dream of some software engineers over
lunch. still, the design is valid and saves a lot of code point assignments and
computer power - by allowing a more compact encoding and by limting the
normalization processing.

sincerely,

markus

Markus Scherer IBM Cupertino, CA schererm@us.ibm.com

Ashley Yakeley <ashley@semantic.org> on 99-10-21 18:31:40

To: "Unicode List" <unicode@unicode.org>
cc:
Subject: Re: Mixed up priorities


At 1999-10-21 17:54, G. Adam Stanislav wrote:

>Yes, we can type "ch" using the GLYPHS "c" and "h", but Unicode prides
>itself in being a character encoding, not a glyph encoding.

Yes, but it does not in general use one codepoint per character.

>To us, "ch" is a character. Period.

That's correct. The single Slovak character 'ch' is represented by a
sequence of two codepoints U+0063 U+0068, just as the single French
character 'à' (a grave) is represented by a sequence of two codepoints,
U+0061 U+0300.

If you need to specify that it's a particularly Slovak 'ch', you can use
Plane 14 language tags.

>In our dictionaries the "ch" follows the "h" and
>precedes the "i". We would never dream of looking for "ch" after "cg" and
>before "ci".

Up to you to write a sensible Slovak sorting algorithm. That's not
Unicode's job.

--
Ashley Yakeley, Seattle WA



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT