Re: DOS or Windows ASCII->Unicode converter

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Jun 29 1998 - 14:13:13 EDT


At 10:10 PM 6/28/98 -0700, Doug Ewell wrote:
>This is quick and dirty, and certainly not fully Unicode-compliant,
>so please don't anybody write to tell me so.

I hope you'll let me add a thought or two anyway. ;-)

>Assumptions:

- Input data is 7-bit ASCII (or) 8-bit Latin-1 (ISO 8859-1). Either works.

>- UINT is a 16-bit unsigned integer.
>- Sufficient space has been allocated for both source and
> destination strings in the calling routine.

Allocating: for ASCII->Unicode the length calculation for pDest is
        size_t len = sizeof(UINT) * (strlen(pSrc) + 1);

>-----8<-----cut here-----8<-----cut here-----8<-----
>
>/* converts a null-terminated ASCII string to Unicode */
>void ascii_to_unicode(char *pSource, UINT *pDest)
>{

This is recommended for the beginning of plain text files only, not
generally for any and all string. Also, if you translate a file in
sections, you would
not want to insert additional FEFFs.

> *pDest++ = 0xFEFF; /* insert byte-order mark */
> for ( ;; )
> {
> *pDest = (UINT) *pSource;
>
> if (*pDest == 0)
> break;
>
> pSource++;
> pDest++;
> }
>}
>
>/* converts a null-terminated Unicode string to ASCII */
>void unicode_to_ascii(UINT *pSource, char *pDest)
>{
> *pDest++ = 0xFEFF;
> for ( ;; )
> {
> if (*pSource != 0xFEFF) /* skip byte-order mark */

This is seemingly incorrect, since only the first FEFF in a file would be a
true byte order mark. (Later ones are treated as ZERO WIDTH NO BREK SPACE).
However, since we explicitly assume that the content of the stream is ASCII,
any stray FEFF's would be errors, possibly caused by someone having naively
appended two files. Since we can't represent them in ASCII, throwing all of
them away is fine.

> {
> *pDest = (char) *pSource;
> pSource++;
> }

it might be worthwhile, even for such a simple piece of code to detect
ALL characters *pSource > 0x00FF and either throw an exception, insert a '?'
or SUB, or skip the characters (depending on what is most meaningful on the
receiving end (for text '?' are great, for filenames they can be deadly ;-)))

>
> if (*pDest == 0)
> break;
>
> pDest++;
> }
>}
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT