Re: Unicode and end users - UTF-8B

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Feb 19 2002 - 13:56:20 EST

Previous message: Eric Muller: "Re: FW: list of abbreviated character names"
In reply to: Lars Kristan: "RE: Unicode and end users"
Next in thread: Lars Kristan: "RE: Unicode and end users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan wrote:

> ...
> The same thing should work the other way around, store Windows filenames
> directly into a UTF-16 database and use UTF-8 => UTF-16 conversion for UNIX
> filenames. Hoping that some day most of the data will be UTF-8 makes this
> even more appealing. As for any data that is not - well, the original byte
> sequence can be reconstructed and a re-conversion can be done based on
> user's settings (or selection) at display time. All you need is UTF-8B
> conversion instead of UTF-8.

I have seen this technique before! :-)

EBCDIC databases have long (20 years?) had the notion of "roundtrip conversions" for interoperability with ASCII codepages.
They did not formally create new codepages (as UTF-8B would be a new encoding) but just "abused" the normal EBCDIC codepage by using a special mapping table.

Such a special "roundtrip" mapping table is a full permutation of an originator codepage (ASCII-based) onto the database codepage (EBCDIC family).
This works best with 8-bit single-byte codepages on both ends, otherwise the originator codepage must have no more valid codes than the database one (used to be the case because few ASCII codepages came close to the 36000-some codes EBCDIC-stateful codepages could express.)

As a full permutation, characters are mapped faithfully if they exist in both codepages, but other characters' codes are mapped arbitrarily and _reversibly_. So a (TM) symbol in an ASCII-family codepage may (for example) be mapped to a Delete control in the EBCDIC-family database codepage; it's preserved because when the client retrieves data, the Delete control gets mapped back to (TM).
This is apparently like UTF-8B, where the roundtrip of arbitrary bytes through UTF-8B and back preserves the original bytes.

As users learn these days, problems come in when the data in the database codepage is used outside the closed system with the two co-dependent codepages.
Printing from the database, conversion from the database codepage to other codepages than the originator/client one, conversion/migration to Unicode can be a nightmare and may require to first convert back to the originator codepage.

It is certainly legitimate to solve certain problems by constructing such roundtrip-faithful mappings.
I don't think that results of such mappings should be advertised as general-purpose encodings. They are just abuses of regular encodings, useful in closed systems and for particular circumstances.

markus

Previous message: Eric Muller: "Re: FW: list of abbreviated character names"
In reply to: Lars Kristan: "RE: Unicode and end users"
Next in thread: Lars Kristan: "RE: Unicode and end users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Feb 19 2002 - 13:34:23 EST