RE: Identifying file encoding scheme

From: Montgomery Securities (mkrebs@primebroker.com)
Date: Thu Sep 09 1999 - 20:54:48 EDT

Next message: Edward Cherlin: "Re: Tool for displaying Unicode numeric values"
Previous message: Addison Phillips: "RE: Identifying file encoding scheme"
Maybe in reply to: Montgomery Securities: "Identifying file encoding scheme"
Next in thread: Erland Sommarskog: "RE: Identifying file encoding scheme"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Yes, all characters in the file are 8 bit encoded, and all characters fall
in the 7 bit range, with no unprintable characters except for CR/LF.

I've found that I can simply insert a space at the end of the first line
using wordpad, and then BCP will load the data into the database just fine,
since it ignores trailing spaces, and the extra space falls outside the
heuristic's idea of what a Unicode file should look like.

That may seem like "problem solved" except that the database gets loaded at
3am, and I'm not ready to be at work at 3am every day to run wordpad! The
same thing applies to running Native2Ascii, in that it is a manual process.
My ideal solution to this Unicode puzzle is to try to figure out what NT
might be looking for when deciding the encoding scheme, and export the data
in a way that is guaranteed to not look like Unicode.

So far, I believe that the heuristics involve line lengths all/mostly being
even or odd. (all but 52 lines out of 9818 in my export have even lengths -
what's the probability of that!) Also, repetitive strings of mostly
alternating characters. (my file is pipe delimited, with lots of columns
with 0 values, so there will be long stretches that look like
"0|0|0|0.0|0.0|0|09091999|0.0|160|0|0|0.0|", etc.) Past these, I'm not
really sure what else NT might be looking for.

Michael Krebs

This file is *strictly* Latin-1 text??

Try dumping it as hex and then insert \u00 in front of every pair of
values. The
Native2Ascii utility in the JDK will then convert it to *any* other
character
set installed... one of which is UCS-2, which you can then read in Notepad
or
elsewhere in the system as expected.

I just had to do this with an enormous Oracle text dump. It's a sick hack,
but
it actually works. ::sigh::

Addison

----------------------------------
Addison P. Phillips
Director, Globalization Consulting
SimulTrans, LLC

+1 (650) 526-4652 (office phone)
http://www.simultrans.com (website)
mailto:AddisonP@simultrans.com (e-mail)

"22 languages. One release date."

Next message: Edward Cherlin: "Re: Tool for displaying Unicode numeric values"
Previous message: Addison Phillips: "RE: Identifying file encoding scheme"
Maybe in reply to: Montgomery Securities: "Identifying file encoding scheme"
Next in thread: Erland Sommarskog: "RE: Identifying file encoding scheme"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT