Re: MES as an ISO standard?

From: Markus G. Kuhn (kuhn@cs.purdue.edu)
Date: Wed Jul 02 1997 - 13:12:06 EDT


jenkins wrote on 1997-07-02 07:00 UTC:
> I guess my whole question on this "ISO 15646" thing is what's the
> difference between it and a Level 1 implementation of UCS-2 other than
> the size of the repertoire?

The properties of the repertoire!

A full ISO 10646 Level 1 implementation is NOT possible on any fixed cell
width system like VT100 emulators, xterm, kermit, or the system I use here
to write this email. Level 1 contains right-to-left characters and
characters that cannot be displayed appropriately in the typical 9x14
pixel cells.

If every developer of an application standard has to identify these characters
herself, then she won't do it, because it is too much work, and because
the subset she gets will most likely slightly different from what someone
who independently comes up with such a subset would get. Many different
subsets created independently for the same purpose are not what
standardization is about.

In addition, the sheer size (~40000 versus ~1000, factor 40) characters of
Level 1 is just mind boggling for any non-Asian developer who is not an
i18n expert and who is not specifically developing applications for the
Asian market.

The statement of Adrian Havill that Unicode is already widely accepted
in the mind of developers is in my very practical experience SIMPLY NOT
WRONG.

I see right today many new specifications being written where Unicode
is not used, because the authors of these standards consider Unicode
to be too complicated and not of a managable size for their project.

Examples:

  - I was in the PNG (portable networks graphics) file format working
    group, and my proposal to simply use Unicode (either UTF-8 or UCS-2)
    for the comment text fields was very broadly rejected by around 90%
    of the other group members, because they claimed that this was
    overkill for a comment field ("we do not have a full blown i18n
    desktop publishing system here"), that Unicode is not widely known
    practice, and that it will cause interoperability problems as the
    sender of the file does not know what full Unicode subset the
    receiver can display.

  - The European Digital Video Broadcasting (DVB) system contains an electronic
    program guide protocol. The standard currently uses a primitive
    switching mechanism between several of the ISO 8859 character sets
    and some character sets that the committee has invented itself.
    Unicode was rejected because the full standard with 40 000 characters
    is not implementable in a settop box with 1 megabyte ROM, and
    there are not easily handable subsets defined. I am sure if they had
    known about MES or if there would have existed ISO 15646, they would
    have used it, because 1000 characters fit easily into a set-top
    box ROM.

  - The GSM text messaging standard (receive email on the LCD display
    of your handy) uses the old CCITT T.61 teletext standard. A well-defined
    1000 character subset of Unicode would also have been acceptable,
    but full Unicode is out of the question for implemntation in a
    cellular phone.

  - The ISO JPEG standard was recently extended by a wrapper file format
    that supplies JPEG and JBIG data streams with auxiliary information.
    The text fields in this file formats use the first byte to switch
    between around a dozend different character seit, ISO 10646 (without
    further qualifications) is just one of them. Why not drop the
    switching mechanism and use only ISO 10646? Answer: Not well understood,
    too complex (we do not want to implement 40 000 characters,
    interoperability problems to to vendor dependent subsetting.

I could go on with many more examples of standards and products where
the designers have deliberately chosen to ignore Unicode and stick with
8-bit character sets, although these were completely new designes without
any backwards portability problems whatsoever and using a 16-bit
character set would not in any way have been a problem.

I have suggested to define Unicode subsets in the examples that I quoted
above. The answer was each time: "We do not have the linguistic background
to come up with a reasonable subset and this is much too much work for
our project. If there were a nice simple subset of Unicode available,
we would have a look at it and quote it, but at the moment ISO 8859
looks so much simpler for us."

Those who say that Unicode is widely accepted in the industry just have
forgotten what the industry is. They are obviouly unable to think beyond
their Microsoft/Netscape horizon wrt applications for character sets.
If you want to see the unified 16-bit character set idea just limited to
the PC GUI operating system world, then you are in the wrong discussion
here.

I do not care about the name of the standard, call it

  - ISO 10646 Level 0
  - MES
  - ISO 15646
  - ISO 18859
  - EUSCII (EUropean Standard Code for Information Interchange)
  - Unicode Lite
  - Unicode--
  - Minicode
  - EuroCode
  - ...

Just make it an easily readable and referencable standard without any
options and mechanisms. Just as easy as ISO 8859-1 is, but with 16-bit
and around 1000 characters (which happens to be a Unicode subset) in
order to allow it to be referenced by the developers of systems who are
not character set experts.

Markus

-- 
Markus G. Kuhn, Computer Science grad student, Purdue
University, Indiana, USA -- email: kuhn@cs.purdue.edu



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT