L2/00-059
UTC/2000-010

"Martin J. Duerst" <duerst@w3.org> on 2000/01/27 06:31:04 PM

Subject:  Re: [li18nux:308] Re: JTC1 conversion tables


At 23:59 00/01/27 +0100, Keld Jn Simonsen wrote:
> On Thu, Jan 27, 2000 at 07:26:19PM +0100, Bruno Haible wrote:
> > Markus Kuhn writes:
> >
> > > The Unicode Consortium (especially Kenneth Whistler) on the other hand
> > > has almost always been able to answer problems competently within less
> > > than 48 hours. That is, why I haven't been impressed with the JTC1
> > > cultural registry so far as a primary source of mapping tables. I prefer
> > > to align all my own mapping data (via a few simple Perl scripts) with
> > > ftp://ftp.unicode.org/ and that is what I also recommend to anyone else.
> >
> > Furthermore the unicode.org tables represent an agreement with the
> > participation of companies including Microsoft, therefore if we use these
> > tables, we can hope for being interoperable with CR/LF based operating systems.
>
> That is one of the problems with the unicode tables, they are controlled
> by a closed consortium.

That consortium is less closed than it seems.

> They are not procuced according to
> an open process, and not standardized (they do not have a format
> defined by an open standard).

The standard is difficult to get, and expensive. And if error corrections
take three months, that's not what we need. We need Internet speed.

What I think we need is:

- A format that is well defined, and easy to process with
  widely used tools. Who defined it is rather irrelevant,
  as far as we are happy with its functionality. A data format
  for conversions is not something you can make money with
  or take over the world, so whether it's defined by an
  'open standard' or whoever is not that relevant here.

- A location (ideally a single one) that is stable and has a
  certain authority, but is flexible enough to accept variants
  if they are needed, and react to problems quickly. Neither
  Unicode nor JTC1/dkuug are there yet, but Unicode is closer
  in my view, and we could probably get it there. If we don't
  get there, we use our own.

- One problem I want to mention for the Unicode site is that
  some of the Chinese tables contain 'gost codepoints'. This
  according to my information resulted from the construction
  of the unified ideographic repertoire. Both China and Taiwan
  apparently added some characters to their standards because
  they urgently wanted them in the unified repertoire, but
  neither the base standards nor the fonts/implementations
  have followed. So the tables are not directly usable, but
  they haven't been changed because they are 'official'.
  At least that's as far as I understand the thing.

- The Unicode TR #15 format at the moment has various problems
  that should be fixed:
  - There is an error in the URI escaping (reported to the author).
  - It is not exactly clear what can be defined with it and
    what not. The description should be improved. For example,
    it is not clear whether it can define iso-2022-jp or not.
    It would be nice if it could, and it should say so clearly
    if it can't.
  - It uses attributes instead of elements for some fields
    where free text can be used. This should be changed.
  - Naming should be revamped. Having a field containing
    a IANA 'charset' name or something else doesn't work,
    because there may be overlaps.
  - There should be only one conversion in one file.
    Including both usual conversions and conversions in
    the case of glyphs stored at control character positions
    should not be done.

  These are the main points on TR #22 I have.


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org


Page 1		Document2