Re: Long-term archiving of electronic text documents

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 29 Jan 2013 09:07:12 +0100

2013/1/29 Jim Breen <jimbreen_at_gmail.com>:
> William_J_G Overington <wjgo_10009_at_btinternet.com> wrote:
>
>> The idea is that there would be an additional UTF format, perhaps UTF-64,
>> so that each character would be expressed in UTF-64 notation using 64 bits,
>> thus providing error checking and correction facilities at a character level.
>
> Error detection and correction at the character level is considered
> very old-fashioned now. Modern techniques such as Reed-Solomon
> codes[1] are much more effective and involve much less overhead
> than the 100% in the proposal above. Such techniques are already
> used in modern disc storage[2], and when combined with RAID
> techniques[3] provide better data protection than character-level
> redundancy ever would.
>
> In any case, I think issues of error detection and correction are
> quite outside the scope of Unicode.

Fully agree ! Character encodings should not depend at all about any
error correction mechanisms that are out of scope and will be
implemented in any upper or lower level meachnisms that will fit the
underlying transport/storage/protocol/application infrastructure.
There will be NO one-fits-all solution as sources of errors and how
they are distributed is completely dependant of these external
mechanisms.

And anyway, character-level error correction is really inefficient and
not even the best solution to prevent or correct errors, a much beter
system will require handling them at the document level (for
file-systems or web requests), or at the block level (e.g. RAID, or
P2P distribution such as torrents), or at the datagram level (over
networks). How you will recover these errors is fully dependant as
well on the existence or non-existence of a return path and protocol :
if such protocol exists and is available, it will be MORE effective to
correct these errors. Autocorrection of errors is there when there's
no return path, or no alternate paths (like mirrors). But they also
depend on the strength of security (against malicious alterations),
and security is also a domain where the data integritty is in constant
evolution (in the algorithms used).

This means that the long term conservation of documents MUST NOT
depend on any of these mechanisms : Unicode MUST remain a standalone
"black box" not working at any other level than plain-text, and
offering absolutely NO security mechanims and NO error recovery
mechanisms. For this reason, it has to define a standard interface
(the code points, and a few standardized UTF encodings, plus the
character encoding model), and nothing else. Any attempt to mix other
standards will in fact make the standard MUCH LESS reliable for long
term conservation of documents and will create new interoperability
problems.

By staying neutral about all other technologies, the Unicode standard
will remain adaptable to all situations and we'll have the maximum
interoperability with ALL security and error correction mechanisms,
which will be specifically tuned to perform the BEST in their OWN
context of use (storage, transport, integrity, security). And this
will NEVER prevent those mechanisms to implement their own local
reencoding, as long as the sequences of code points are preserved
(some of them will require the plain-text to be normalized, and it is
generally acceptable, according to the standard, that these
*conformant* processes will preserve *at least* the canonical
equivalences.

Some of them will perform compatibility mappings, but for Unicode,
these processes will be *lossy*, and they should not be used for
preserving the data integrity or security, and connot be used for safe
error correction mechanims this includes corrections by orthographic
spellers, which are performing guesses and remapping some distinct
characters to other ones, e.g. changing dashes into ASCII hyphens).

Some transport protocols will not support the preservation of Unicode
texts : this is the case of SMS over basic GSM networks : these
networks do not provide any data integrity mechanism except for a very
reduced subset of the standard (for example it dos not preserve Greek
or Cyrillic texts because they encode the same way the Latin letter A,
the Cyrillic letter A and the Greek letter Alpha, assuming that the
effective interpretation will be made according to an language
indicated somewhere else in the transport, or assumed by the transport
network or by client device settings). Once again, as long as these
processes or protocols remain in their own context of use within which
they are operating, they are safe and can be secured, but they can't
leave with interoperability over other networks or storages (notably
if the language indicator is lost, knowing that Unicode itself does
not encode languages directly, as they are out of scope of the
standard).

So let's focus just on the Standard itself, not extending its scope of
use. It's not up to TUS to regulate what other protocols will do.
Unicode just describes some conformance levels :
- level 1. preservation of code points
- level 2. preservation of canonical equivalences
- level 3. preservation of compatibility equivalences
- level 4: all other conformant processes (including for example
"best-fit" reencoders, or transliterators, or text-to-speech
renderers).
but this never means that other protocols DO HAVE to support one of
these three standardized profiles. But if we need long term
preservation of documents, only protocols and encodings preserving the
level 1 should be used. So we'll still need external protocols that
perform well in this level (Unicode UTFs are operating at this level
1, so they are safe to use when building conforming interfaces between
various external processes or protocols, for data transmission or
storage, and with maximum internoperability between all these level-1
conformant processes, independanly of what they will effectively
transmit or store, but if these external processes fail at preserving
this level 1, they MUST not say that they conform to this level and
must not tolerate ANY deviation from it from another higher level).
Received on Tue Jan 29 2013 - 02:13:10 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 29 2013 - 02:13:11 CST