RE: Identifying file encoding scheme

From: Krebs, Mike (MKrebs@bofasecurities.com)
Date: Mon Sep 13 1999 - 22:59:54 EDT


Luckily you told that to me instead of microsoft. THEY would have charged
you to fix their mistake.

:-)

Seriously, I think microsoft meant to imply that most people consider the
Hex bytes FE FF 41 0D 0A 1A to be the sequence you were talking about. But
since checking for the signature is a different test among the list you can
choose from, then they didn't include it in the statistics based test. It
seems that some applications are choosing to forgo the
IS_TEXT_UNICODE_SIGNATURE test, which checks for the FE FF, and just use the
statistical analysis. Personally, I think this is a bonehead play, but one
could make the case that just because a file loses the first two bytes off
the front doesn't mean that the entire file should suddenly become garbage.
Of course, my file didn't lose any bytes, and it suddenly became garbage
anyway. Maybe retrofitting Unicode on an ASCII system was the bonehead play,
by that I mean not just choosing one or the other and going with it.

On another note, you work for Sybase, so you're in a unique position to
answer the following question: In the NT version of BCP, is there a call to
the function I mentioned, IsTextUnicode(), in the portion that imports pipe
delimited files? If so, what are the tests that are specified in the
arguments to this function?

Thanks!

Michael Krebs
Bank of America Securities

> -----Original Message-----
> From: kenw@sybase.com [SMTP:kenw@sybase.com]
> Sent: Monday, September 13, 1999 9:21 PM
> To: Unicode List
> Cc: unicode@unicode.org
> Subject: RE: Identifying file encoding scheme
>
>
> > if lpBuffer points to
> > the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the
> > IS_TEXT_UNICODE_STATISTICS test, though failure would be preferable.
>
> Picking nits, I presume you mean:
>
> 0x41, 0x0D, 0x0A, 0x1A (A\r\n^Z)
>
> (Starting with Unicode 3.0, U+410D U+0A1A actually is a valid sequence of
> two assigned Unicode characters: U+410D is a *very* rare alternate for
> the "common" form U+8721, referring to a year-end festival of the Zhou
> dynasty. U+0A1A is the GURMUKHI LETTER CA. Not exactly a likely
> combination
> in real text, I warrant.)
>
> --Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT