Re: Normalizing Transcoders

From: Mark Davis (mark@macchiato.com)
Date: Mon Jun 24 2002 - 10:15:44 EDT


> Working in Java, which of the commonly supported character encoding
might
> have non-normalizing transcoders. And, with the transcoders shipped
with
> Java which are non-normalizing?

Any code page with combining marks will generally not be. So the
Unicode encoding forms themselves, ISO Latin Arabic (ISO-8859-6,
Windows Thai (cp874), etc.

To check converters in detail, the easiest thing to do is to
- take each converter;
- convert all Unicode characters to it and back (sifting out the
rejects);
- check the result with QuickCheck as described in TR#15*.
- If there are any NO or MAYBEs, then the converter is
non-normalizing.

* There is an implementation of QuickCheck in ICU4J
(http://oss.software.ibm.com/icu/), in the current 2.2 snapshot (2.2
final will be released this summer, but I think the snapshot should do
the trick).

Mark
__________
http://www.macchiato.com
◄ “Eppur si muove” ►

----- Original Message -----
From: "Jeremy Carroll" <jjc@hplb.hpl.hp.com>
To: <w3c-i18n-ig@w3.org>
Sent: Monday, June 24, 2002 01:52
Subject: Normalizing Transcoders

>
>
>
> The W3C Character Model Working Draft [1] defines the concept of
normalizing
> transcoder This is a transcoder from a non-UCS based encoding to
Unicode
> whose output is in NFC.
>
> Working in Java, which of the commonly supported character encoding
might
> have non-normalizing transcoders. And, with the transcoders shipped
with
> Java which are non-normalizing?
>
> (e.g. I suspect for ASCII it is impossible to write a
non-normalizing
> transcoder, I don't know iso-8859-1 backwards, but also get the
impression
> that all actual transcoders will be normalizing).
>
> To bound the issue my ambitions do not stretch beyond those encoding
> supported by Xerces-J. These are listed as:
>
> UTF-8
> UTF-16 Big Endian, UTF-16 Little Endian
> IBM-1208
> ISO Latin-1 (ISO-8859-1)
> ISO Latin-2 (ISO-8859-2) [Bosnian, Croatian, Czech, Hungarian,
Polish,
> Romanian, Serbian (in Latin transcription), Serbocroatian, Slovak,
> Slovenian, Upper and Lower Sorbian]
> ISO Latin-3 (ISO-8859-3) [Maltese, Esperanto]
> ISO Latin-4 (ISO-8859-4)
> ISO Latin Cyrillic (ISO-8859-5)
> ISO Latin Arabic (ISO-8859-6)
> ISO Latin Greek (ISO-8859-7)
> ISO Latin Hebrew (ISO-8859-8)
> ISO Latin-5 (ISO-8859-9) [Turkish]
> Extended Unix Code, packed for Japanese (euc-jp, eucjis)
> Japanese Shift JIS (shift-jis)
> Chinese (big5)
> Chinese for PRC (mixed 1/2 byte) (gb2312)
> Japanese ISO-2022-JP (iso-2022-jp)
> Cyrillic (koi8-r)
> Extended Unix Code, packed for Korean (euc-kr)
> Russian Unix, Cyrillic (koi8-r)
> Windows Thai (cp874)
> Latin 1 Windows (cp1252) (and all other cp125? encodings recognized
by IANA)
> cp858
> EBCDIC encodings:
> EBCDIC US (ebcdic-cp-us)
> EBCDIC Canada (ebcdic-cp-ca)
> EBCDIC Netherland (ebcdic-cp-nl)
> EBCDIC Denmark (ebcdic-cp-dk)
> EBCDIC Norway (ebcdic-cp-no)
> EBCDIC Finland (ebcdic-cp-fi)
> EBCDIC Sweden (ebcdic-cp-se)
> EBCDIC Italy (ebcdic-cp-it)
> EBCDIC Spain, Latin America (ebcdic-cp-es)
> EBCDIC Great Britain (ebcdic-cp-gb)
> EBCDIC France (ebcdic-cp-fr)
> EBCDIC Hebrew (ebcdic-cp-he)
> EBCDIC Switzerland (ebcdic-cp-ch)
> EBCDIC Roece (ebcdic-cp-roece)
> EBCDIC Yugoslavia (ebcdic-cp-yu)
> EBCDIC Iceland (ebcdic-cp-is)
> EBCDIC Urdu (ebcdic-cp-ar2)
> Latin 0 EBCDIC
> EBCDIC Arabic (ebcdic-cp-ar1)
>
> Jeremy
>
> [1] Charmod
> http://www.w3.org/TR/charmod
>
>
>



This archive was generated by hypermail 2.1.2 : Mon Jun 24 2002 - 09:07:03 EDT