Re: DOS or Windows ASCII->Unicode converter

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Jun 30 1998 - 02:26:34 EDT


Asmus Freytag <asmusf@ix.netcom.com> wrote:

> - Input data is 7-bit ASCII (or) 8-bit Latin-1 (ISO 8859-1).
> Either works.

Very true. I had ignored the possibility of 8859-1.

> This is recommended for the beginning of plain text files only, not
> generally for any and all string. Also, if you translate a file in
> sections, you would not want to insert additional FEFFs.

Asmus is, of course, correct; you do not want to insert FEFF at the
beginning of every line of a text file. I was originally imagining
that the routine might be called once, to convert the entire file
in one gulp, but of course the more common approach would be to
call it for each line. This was a serious misuse of FEFF on my
part; it would have been better to ignore it completely.

> This is seemingly incorrect, since only the first FEFF in a file
> would be a true byte order mark. (Later ones are treated as ZERO
> WIDTH NO BREK SPACE). However, since we explicitly assume that
> the content of the stream is ASCII, any stray FEFF's would be
> errors, possibly caused by someone having naively appended two
> files. Since we can't represent them in ASCII, throwing all of
> them away is fine.

Exactly what I meant by quick and dirty, but not Unicode-compliant.

> it might be worthwhile, even for such a simple piece of code to
> detect ALL characters *pSource > 0x00FF and either throw an
> exception, insert a '?' or SUB, or skip the characters (depending
> on what is most meaningful on the receiving end (for text '?' are
> great, for filenames they can be deadly ;-)))

For C code they can be deadly as well, since the preprocessor may
treat ??x sequences as trigraphs and convert them to hell knows
what.

I have written routines for myself to convert between Unicode and
8859-1, Windows 1252, and MS-DOS CP 437, and to handle out-of-band
8-bit characters by outputting (1) a question mark, (2) a 'close'
character, such as 'A' instead of 'A tilde', or (3) the string
'[U+xxxx]'. None of these solutions is perfect for all situations;
each has its own drawbacks.

For ASCII or Latin-1 text, the present two routines do provide a
round-trip conversion to and from Unicode (which is a far cry from
saying they provide a *good* conversion). With any luck, however,
they may allow Vik to save his $900.

-Doug



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT