Re: Unicode certification - was RE: Dublin Conference:

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 25 2002 - 20:48:21 EDT


Dave said:

> Let me go a bit deeper in what I
> mean by compliance levels.
>

Part of the problem that one gets into when talking about "Unicode
certification" is the following:

  A. "Unicode" is not a product which can be tested.

  B. "Unicode" is not a feature which can be tested for in a product.

The Unicode Standard is a complicated character encoding standard
that people implement to, in hopes of creating interoperable
character processing for their IT products.

Because it is complicated, nobody can implement to all of it at
once, and likely nobody can implement all of it in any case.

Because the legacy of character sets is complicated, deep, and
persistent, nobody can make their software turn on a
dime, and all successful software, systems, and protocols will
continue to contain transitional features for a long time to
come.

Because the Unicode Standard is a *fundamental* standard, having
to do with the encoding of characters rather than with a high-level
protocol or a discrete user feature, its impact tends to be
distributed, often in invisible ways, throughout the infrastructure
of systems, rather than appearing -- at least immediately -- as end user
features.

In an almost literal sense, implementations of the Unicode Standard
have to *infiltrate* systems, rather than simply being added on.
It has to permeate the interstices before higher-level pieces can
start depending on its functionality being available. How does one
go about certifying an infiltration?

And the devil is in the details. Looking a bit at your suggestions,
for example:

> 1. Unicode support is implemented and allows for same functionality as
> with any other legacy encoding system. Detail: up to which Unicode
> release this support is implemented.

What does "Unicode support is implemented" mean, anyway? This
kind of begs the question. I think what you might be getting at
here runs along the lines of "UTF-8 is supported as just another
co-equal multibyte character set, just like the other legacy character
sets." At one level, this is quite easy, and actually pretty
trivial for many "applications", if they are just shoving strings
around. But once you start to look at the semantics of individual
characters, then the conformance requirements open up a whole
can of worms. You have to start talking about what subset of the
Unicode repertoire you support, for what types of processing, and
which sets of properties you pay attention to. If the "same
functionality as with any other legacy encoding system" includes
being able to tolower and toupper a string, for example, does that mean you
can do the same for a UTF-8 string? -- for all Unicode characters?
--just using the one-to-one mappings of UnicodeData.txt, or also
taking SpecialCasing.txt into account? And so on. Once you
start talking about "support" of any single Unicode character,
you inevitably slide over to having to consider "features" and
enumerating them.

> 2. Additional Unicode support is implemented and and offers the
> following list of features beyond legacy encodings: [list of features],
> for example ICU is fully implemented.

And how are the "features" to be enumerated? The Unicode Standard
itself doesn't give a list of features of support, so somebody
is going to have to construct such a list. And the problem is that
such lists of features tend to vary widely depending on what
kind of product one is talking about. Take a look at
http://www.unicode.org/unicode/onlinedat/products.html
The "features" relevant to "Unicode support" in a database tend
to be rather different from what you would look for in a font,
or a text editing tool, or in a programming library, for example.
SlickEdit 7.0.1 claims support for "Unicode Level 1 regular
expression support" (cf. UTR #18), but what relevance would that
be to evaluating a "Unicode" font?

Not that having matrices of feature support wouldn't be useful
for people evaluating products! Of course they would. But we aren't
talking about something which could just be approached *from* the
Unicode Standard point of view as a checklist applied to
arbitrary products -- the field is too vast. We are really talking
about evaluating the *entire* range of information technology,
from databases and repositories to editors to operating systems
to fonts to browsers to online games to other standards and protocols.
You really have to approach things from the *other* end. In a subarea
of information technology, ask yourself how could support for
the Unicode Standard reasonably manifest in this type of
product (or standard, or whatever). Then do a typology of features
relevant to that product area involving manifestation of the Unicode
Standard, and then apply that typology in a matrix of judgements
regarding feature applicability and level of support for the products
in that area. *Then* and only then, in my opinion, will you start
to get meaningful results and meaningful comparisons that really
help end users regarding product purchase decisions.

The particular example you cite, "ICU is fully implemented", isn't
a particularly good one. ICU itself is a very extensive library
which *exports* Unicode-related functionality via an extensive
set of API's. One could typologize its features and test them
for how well they implement the Unicode Standard. And then you
could set that up against any other similar library of
internationalization functionality related to Unicode. But it
doesn't mean much to say that *another* product has fully implemented
ICU (presumably by using it). A product may make use of ICU without
surfacing any feature to an end-user that is obviously connected
to Unicode in any way. For example, I could use ICU merely to format
currency strings for a spreadsheet that only supports ISO 8859-1 for
its characters. And one doesn't test either the internationalization
functionality of a product or its support of the Unicode Standard by
checking to see that once it has linked the ICU library it calls 5%,
45%, or 100% of the ICU APIs.

> 3. Full Unicode support is implemented - all characters can be
> processed, all glyphs are available, and rendering complies to all
> rules for each writing system. (I hope I used the correct terms here.)

This is the dream system. It is never really going to exist, since
even the operating system platforms, which have the greatest
incentive to get to this point, are going to draw the line at
trying to deal with all the typographic details needed for
full rendering of many historic scripts. Those will inevitably
involve add-on speciality products that focus on *particular*
areas in more detail. There are breadth specializations and
depth specializations -- and when you look at the entire scope
of the Unicode Standard, which after all is attempting to encode
all the characters of all the writing systems of all the history
of the world, you just cannot expect that any one software system will
ever do it all.

> Tbh, I am not sure where to draw a
> line between 2 & 3, I think it is a gray zone, rarely found today.

Actually, the greyer zone is between 1 & 2, in my opinion.

--Ken

P.S. I'm not picking on Dave's suggestions in particular -- I'm
just using them as a springboard for vamping on about what I think is
the tip of the iceberg regarding what "Unicode certification"
would actually involve.



This archive was generated by hypermail 2.1.2 : Thu Jul 25 2002 - 18:58:46 EDT