Re: latin1 decoder implementation from Philippe Verdy on 2012-11-18 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 19 Nov 2012 01:28:42 +0100

I think that Python will provide instead a factory that will return the
appropriate concrete codec when given an encoding code and the standard
body to which it must be conforming to : ISO, IETF (for MIME and the IANA
database, as specified in RFC's), W3C (for HTML5), and possibly other
private standards (e.g. Microsoft, IBM, Apple and Adobe for their own code
pages), ITU (for some GSM encodings and encodings used in teletext)
Instructed with the standard type (or registry), the encoding "name" can be
mapped correctly without needing reimplementations and new conformance
tests and validations.
Note that the default Encoding class in Java does not have such indication
of the registry, it assumes its own registry which does not recognize the
same set of encoding names and aliases.
Now if you go to the list of encodings supported in each OS, each one has
its own flavor, so the OS type would also be indicated as one of the
possible registries. Some of them will make diferences between
capitalization forms, or in their use of separators. There are also
reistries implemented in various RDBMS engines (some of them storing the
mapping in a system table where they are extensible, sometimes implemented
as a simple table, sometimes as a Java class or procedure/function written
in the query language, and stored in the database).
In other words, before the layer implementing the actual codecs, there's a
layer to map the various possible registries.
A factory could also be implemented by looking for a few entries for its
own definitions, and then searching for aliases within another default
registry. Registries can be chained, but the IANA database should be at end
of all chains starting from a given registry.
Registries may also be "pluggable" beside what is in the library or system
level, using a EncodingProvider that will implement a registry.

A good codec implementation should also support these 3 modes of operation :
* mapping unknown/invalid codes as exceptions that will be thrown without
returning the converted sequence
* mapping a default valid replacement character (which should be
configurable)
* ignoring the invalid codes (possibly returning a status saying that the
conversion was lossy).

In addition a codec could also work in a "tolerant" mode : when several
source codes are mapped to the same target code, and one of them is
considered "canonical" and the other ones are just "aliases", the
conversion is not reversible exactly if the source text contains one of
these aliased codes. But if working in strict mode, these source modes
could be either signaled by an exception, or returned by still indicating a
lossy result status. But for some "standards" the encoding is also
ambiguous (e.g. in legacy GSM encodings, which are still widely used, you
cannot make a difference between a Latin letter A, a Cyrillic letter A and
a Greek letter Alpha, without first looking at the language code, which may
still be ambiguous).
Received on Sun Nov 18 2012 - 18:31:36 CST

This archive was generated by hypermail 2.2.0 : Sun Nov 18 2012 - 18:31:36 CST