Re: Encoding alternate character sets in tEXt/zTXt strings

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Wed Mar 18 1998 - 20:38:46 EST


> >iTXT 4 bytes
> >language code, as specified in RFC 1766 (for examples of use see LANG
> >attribute in HTML 4.0[2], and the xml:lang attribute in XML[3])
> >null byte
> >keyword, UTF-8
> >null byte
> >compressed text, UTF-8
> >null byte
> >checksum
>
> At the risk of seeming anglocentric: the chunk should begin with
> a keyword in Latin-1, and the keyword should be chosen from the
> same set as those used for tEXt and zTXt, so that libpng functions
> can be used to process the chunk by keyword even when UTF isn't
> supported. Also the Latin-1 keyword should appear first:

Ummm... isn't this a hack? Is this implying that the libpng functions are not
8-bit clean? If the keyword is not in English, and libpng searches for it, it is
guaranteed not to hit (because UTF-8 is unambiguous in that all non-ASCII is
guaranteed to have the 8th bit set).

I think that UTF-8 never has the 0xFF code anywhere in it... as well as 0x00.

> iTXt 4 bytes
> keyword (1 or more bytes, Latin 1 text, see PNG rules for tEXt keywords)
> null byte
> character set (1 byte)
> 0: UTF-7
> 1: UTF-8

This is redundant. If libpng can handle a 8-bit NULL terminated string, it
should stick with UTF-8. If it can't, (b/c it uses the eighth bit for control
purposes) it should use UTF-7.

> language code (see RFC 1766)

You'll run into trouble with this... The text could be in multiple languages.
The language code and such is really only appropriate for cultural rendering and
sorting and it's correct implementation very complicated. Best to leave it as
just UTF-8.

> null byte
> keyword (translated into the specified language and charset, not compressed)

Again... you'll run into problems with "translated" keywords because translation
into Latin-1 is often ambiguous for certain languages. Example: The word "Sushi"
in Japanese can be romanized into "Sushi" or "susi" and Mount "Fuji" can be
romanized into "Huzi" or "Fuji", depending on the method of "romanization" used.

What happens if the person doesn't know how to translate the word? Or if there
is no direct translation for a word? (ie a word in a language has to be
translated into a phase in order to represent the meaning)

Best to leave it at one keyword (to avoid causing situations like ambigious
search problems), and fix libpng to do an 8-bit clean search/processing of
keywords.

> null byte
> compression method (1 byte)
> 0: zlib DEFLATE
> 1: uncompressed
> compressed or uncompressed UTF text (includes checksum if compressed)
>
> This solution is OK by me. I mentioned the uuencoding in response
> to the assertion that it's not possible to store UTF in PNG files now.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT