Re: MES instead of ISO 8859-nn

From: Markus Kuhn (kuhn@cs.purdue.edu)
Date: Thu Jul 03 1997 - 23:02:19 EDT


Antoine Leca wrote:
> Unicode is a set of characters, not glyphs, isn't it?
> So the lack of glyphs can't be used to say that NT isn't
> full-Unicode.

A full Unicode implementation must be able to adequately visually
represent any of the ~40000 characters and any combination of
those ~40000 characters (especially with regard to combining characters
and representation forms). For a full Unicode implementation, you
will probaby need much more glyphs than there are characters in the
set.

I do not see any implementation that comes even close to fully
implement Unicode. Therefore, I feel that well-established subsets
are highly important, or Unicode will cause a lot of user frustration.

An example for a practical character subset frustration is ISO 8859-1
versus Microsoft CP1252 (the Windows character set):

Today, I have already the problem that Web page authers using
MS-Windows use the CP1252 characters in the 0x80-0x9f range that
are not part of ISO 8859-1 and therefore are IN THEORY not allowed
in HTML. I can't see these characters on my fully HTML conforming
system, and the Web authors who are unaware of what the proper subset
of their available character set is inappropriately use frequently
the characters LEFT SINGLE QUOTATION MARK and RIGHT SINGLE
QUOTATION MARK (0x91 and 0x92 in CP1252, 0x2019 and 0x201c
in Unicode) where they really should use just QUOTATION
MARK (0x22) if the character set announced by HTTP for this
Web page is ISO 8859-1.

I expect problems like this to be many orders of magnitude worse
once Unicode starts to get widely used on the Web. The above
problem is at least well-defined, the people using the
0x80-0x9f characters in HTML are clearly wrong, the HTML specification
leaves no doubt about this. The problem is just that the authors
of HTML export filters of one very popular word processor have been
ignorant about the problem (I won't mention names here).

However, once you say that all of Unicode is allowed and
every implementation just arbitrarily selects
its subset, it will not even in theory any more be possible to
blame whether the sender or the receiver is responsible for the
messed-up document that has been displayed.

I prefer to see Unicode due to its enormous size not as a standard
that can simply be quoted in a specification to ensure
interoperability, but more as a framework, from which you have
to select a carefully defined subset and then use this as the
character set that you allow in your application.

> But if you agree upon a content (which is very likely), then
> you can presume both of you have means to view each possible
> glyphs of the content.

Sure?

Ok, say we agree as content on texts in the German language.
Now you interpret this the way that ISO 8859-1 claims to support
German completely, therefore you implement only the characters
< 0x0100 and claim that your system supports Unicode for the
German language. I on the other hand know that German typography
uses low open quotation marks, therefore, I obviously will use
DOUBLE LOW-9 QUOTATION MARK (0x201e) in all German texts that
I produce. I can already see the long discussions that we will
have in the de.* newsgroups whether the character 0x201e should
be allowed in German postings or not, since the manufacturer will
have different opinions of what characters are necessary for the
German market.

That is the sort of trouble that I expect with not strictly defined
subset implementations of Unicode and that is why I would love to
see well-established Unicode subset standards that we can easily
refer to.

I am sure that we *will* see some convention with regard to
acceptable Unicode subsets soon, but I fear that these conventions
will have the form "whatever the latest MS-Windows implementation
happens to support in its most popular fonts". I'd prefer to have
a hand full of useful subsets defined by ISO instead, to create
some vendor independence and long-term planability.

> I don't think giving publicity to MES is a good idea.
> MES is a concept for dealing with backward compatibility.
> Since it's bigger that 256, it is of no use on 8-bits
> transmission channels (UTF-8 is much better).

Technically, you will handle MES just like Unicode, as it is just a
strict Unicode subset. Of course I will use UTF-8 in order to
store my MES files on my Linux system, etc.

> Another such example is the Windows Glyph List, WGL4, the
> subset Microsoft promotes and implements in its large fonts
> (it prints Latin 1,2,5, Greek and Cyrillic as most PC code
> pages). It is 652 characters large, and with respect to MES,
> it lacks the Ancient Greek (katharemiou) precomposed
> characters (those with spirits and various accents).

I am not familiar with WGL4, but it looks to me that both WGL4 and
MES are so close that they probably could be merged to this
ISO 15646 that I envision. The existence of WGL4 demonstrates that
I am not alone with my idea of a well-established <1000 character
Unicode subset.
 
> > Unfortunately noone seems to know about MES. It is defined in
> > a CEN standard ENV 1973:1995, and described by Michael Everson
> > on
> >
> > http://www.indigo.ie/egt/standards/mes.html
>
> OTOH, the same standard defines another set, EES, which is
> defined as selected subranges of ISO 10646 (if I remember
> well, Latin, Greek, Cyrillic, Armenian, Georgian and
> the "graphic space", U+2xxx).
>
> See <URL:http://www.indigo.ie/egt/standards/ees.html>
>
> It shares the same properties as MES that Markus described.
> I see it as a better candidate than MES for ISO15646¹!

3000 is already significantly larger than 1000, but I agree that
this might be another reasonable intermediate step, especially
if MES is more targeted towards the lowest cost application
area (set-top boxes, palm-tops, dumb terminals), while EES is
more towards the European needs in a GUI environment.

Ok, we could have

  ISO 15646-1 something like MES, WGL4 (minimum European)
  ISO 15646-2 something like EES (GUI European)

what else? I guess for instance that also Japanese users might
be interested in a well-defined Unicode subset (ISO 15646-3),
that might for instance be something like MES plus Japanese
ideographics, but no bidi characters, no Indic scripts, etc.

  ISO 15646-3 (Japanese)
  ISO 15646-4 (India)

In the end, ISO 15646 could become a very small family of subsets
for specific language families, very much like ISO 8859 is
already, with around 5 or 8 different subsets that can be easily
referenced.

Then I implement ISO 15646-1 for the European simple application
market and ISO 15646-2 for the European GUI market and
ISO 15646-3 for the Japanese market. And the beautiful thing compared
to ISO 8859 is that if I want to support both, I just merge the sets
and I do not need switching mechanisms any more, as all of these
are Unicode subsets anyway. But we will not be talking about Unicode,
we will talk about full 100% ISO 15646-1 implementations, and
there will be no doubt about what characters are supported or not.

My e-mail MIME header will announce that this posting is in the
ISO 15646-1 character set, and we do not have to talk any more
about "the Unicode subset that Windows-NT 4.0 currently supports in
more than a third of its fonts".

> > I see already with horror today, that programming language standards
> > allow *all* ISO 10646 characters in identifier names.

> I *hope* standards writers don't make such an error as
> using ISO10646 level 3, but rather level 1, which should
> leave the composition problem away and remove a lot of
> burden from the compilers' writer.

Well, I just checked the Java spec, and they do allow combining
characters in identifiers, but they warn about the problems. They
also warn about problems with homoglyphs like latin letter A and
cyrillic letter A, but I guess these are unavoidable, even in small
subsets like MES. I hope regular expression mechanisms will
deal with homoglyps like the micro sign and the greek small mu, or
the capital letter K and the Kelvin sign (never understood the
difference between those anyway).

Markus

-- 
Markus Kuhn, Computer Science grad student, Purdue
University, Indiana, US, email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT