Re: texteditors that can process and save in different encodings

From: Doug Ewell <doug_at_ewellic.org>
Date: Sat, 20 Oct 2012 15:39:19 -0600

When a Major Software Company, which sells the Well-Known Operating
System that I and a few other people use and develop for, decides to add
character-encoding metadata to the file system of that OS, and when
versions of that file system that support encoding metadata are
widespread enough that I no longer need to target my apps to previous
versions, then I too will consider encoding detection to be a thing of
the past.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­
-----Original Message----- 
From: Philippe Verdy
Sent: Friday, October 19, 2012 16:45
To: Doug Ewell
Cc: Stephan Stiller ; unicode_at_unicode.org
Subject: Re: texteditors that can process and save in different 
encodings
2012/10/20 Doug Ewell <doug_at_ewellic.org>:
> Suppose I have a file called 'karenina.txt' on my flash drive. Let's
> assume we can trust from the .txt extension that it really is a text
> file of some sort (that is metadata). Now, what encoding is this file
> in?
May be you can't know that, may be the filessystem still stored that
information (it can do that independantly of the given and visible
filename).
> See Stephan's comment again about the editor doing charset
> detection.
I don't like charset detection at all. I'm a strong supporter of
separately stored metadata. It is always possible in all filesystems,
even if this requires a convention for organizing the content of that
filesystem.
> Right, but you talked about "saving them as ASCII (i.e. saving this
> charset information in the metadata)". This is explicit metadata, not
> the implicit type that you're talking about now.
Why? He saves in ASCII because this is what the editor will perform.
There's not necessarily a choice for it, the storage as ASII will
still occur even if the editor does not store *itself* that metadata
along with the file content and at the same time (the user may store
itsefl the metadata needed for later processings in other tools or by
other users to avoid wrong "guesses", even these hazardous guesses
performed by automatic charset detectors, that I absolutely don't like
at all as they will always fail silently, sooner or later, with wrong
guesses).
As I'm a strong supporter of metadata, these metadata should never be
ignored by editors where they are accessible (and notably when they
are part of the storage properties and capabilities of the
filesystem).
Each time a user needs to reprovide itself the missing metadata, using
his own guesses, or using some automatic detector implemented in his
software, this will inevitably break.
Just like you want to work on a file only once, and encode it only
once, you should never depend later on future guesses, even if (and
notably when) the file is later transparently to a more convenient
encoding for some other editors or tools. The metadata is as much
important to preserve and transmit as the content.
A "text file" without the specification of its metadata about how it
is encoded is absolutely not "plain text" for me. It's just a binary
stream, even if it has a "file name" or a basic extension (like
".txt") that does not specify correctly how to read and process it. 
Received on Sat Oct 20 2012 - 16:44:18 CDT

This archive was generated by hypermail 2.2.0 : Sat Oct 20 2012 - 16:44:20 CDT