Re: texteditors that can process and save in different encodings from Asmus Freytag on 2012-10-21 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Sun, 21 Oct 2012 19:48:15 -0700

On 10/21/2012 4:09 AM, Philippe Verdy wrote:
>> Unless there's a way to rebuild the metadata unambiguously or to enforce
>> >that it is complete and correct, it's very hard to rely on it for any
>> >particular purpose.
> Enforcing that the metadata is correct is perfectly possible, at least
> to ensure that it matches the requirements. (For example, an incorrect
> encoding, given in metadata, should be signaled each time it violates
> one of its rules : this is possible for many text standardized
> encodings, including UTF's).

It may be possible to do some verification of well-formedness for
well-designed encoding schemes like the UTFs but, pray, how do you tell
apart 8859-1 from 8859-15?

These are not rarely occurring character sets and enforcement for them,
as for any of the other 8859 series would only be possible if you were
to do the very same character-set sniffing that you so dislike.

If you run a variation of a language detector, it's possible to detect,
for example, that the text is in Icelandic, and therefore requires
8859-1 instead of 8859-15. That is because the few code points that are
mapped to different characters in these two sets would be appearing
(statistically) in the wrong context.

This is something a clever text editing (or HTML editing) tool could do,
but not something that you can build into an OS.

Anyway, to cut the discussion short, I'd love to see a working example
of any system where metadata are 100% reliable.

A./
Received on Sun Oct 21 2012 - 21:54:54 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 21 2012 - 21:54:56 CDT