Re: texteditors that can process and save in different encodings

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 18 Oct 2012 00:09:25 +0200

This depends on how these converters are integrated in the editors.
Whilst there are good reasons to embed a few common converters in
editors, there's also a place for supporting optional pluggable
conversions in these editors, when needed (the way they are plugged in
does not matter: this could be a support library part of the OS like
in Windows, or a common library shared by various applications like
iconv on Linux and *nix, or a selectable conversion table that the
editor will allow to interpret and insert in user parameters.

But the most basic converters between encodings (not syntax
transformers such as converting characters into escape sequences for
specific computer languages) should be integrated (this includes
standard UTF's, notably UTF-8 and probably UTF-16, ASCII, and most
probably ISO-8859 1, and its Windows 1252 extension which replaces the
deprecated C1 controls from ISO 8859, as agreed now in HTML5 and most
common practices ; this should also include the integrated support for
local encodings that are already natively integrated in the OS for its
legacy 8-bit encoding, which should be supported by using local OS
API's, each time it is present and this interface has not been
deprecated and removed from the basic OS installation itself as it was
no longer necessary for most of its services, or made necessary for
the requited support of national standards, like GB18030 in versions
of these OSes for the P.R. of China : no need to integrate this in the
editor itself, just detect and use the local OS services).

However there are other needs to address : interoperability between
common OSes used in the same country. This is not just a question of
encoding converters, but also of OS conventions like the
representation of newlines. As of today, two conventions for newlines
should be supported and convertible: LF or CR+LF; the legacy support
for CR only (only needed for now very old versions of MacOS that are
no longer supported directly, or C1 NL on remaining EBCDIC-based OSes
that still don't use a modified EBCDIC codepage using C0 LF instead)
could remain optional, not installed by default to encourage the
conversion of older texts using external tools, cleanup of legacy data
and simplifications for later code developments and supports.

For legacy systems that still don't have native support of standard
UTF's (or even ASCII), the local installation of a separate code
converter (not necessarily in editors themselves) should be encouraged
to allow the progressive migration of existing data (this is not
always easy if encoded texts are remaining not in text files but in
some database files with a binary structure : for them, the database
programs should be upgraded to support a code conversion through its
API, so that applications would no longer be required to perform their
queries using the legacy encodings, even if the database storage files
are not converted, and will not be converted for long; if these
database files are kept only for archiving reasons, they don't need to
be converted, but they can be queried in read-only mode through an SQL
API that will return their content transparently converted to an
UCS-based encoding, allowing the maintenance and migration of
applications to newer and more versatile encodings with better
interoperability and lower maintenance costs).

2012/10/16 Stephan Stiller <stephan.stiller_at_gmail.com>:
>
>> And I think it’s really better to use dedicated code converters rather
>> than build a large number of character code and encoding conversions into
>> various application programs.
>
>
> Do you (or others with experience) have opinions on specific programs
> (iconv, uconv, ICU facilities, ...) and their limitations or powers?
>
> Stephan
>
>
Received on Wed Oct 17 2012 - 17:16:33 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 17 2012 - 17:16:36 CDT