RE: Identifying file encoding scheme

From: Addison Phillips (AddisonP@simultrans.com)
Date: Thu Sep 09 1999 - 18:50:45 EDT


This file is *strictly* Latin-1 text??

Try dumping it as hex and then insert \u00 in front of every pair of values. The
Native2Ascii utility in the JDK will then convert it to *any* other character
set installed... one of which is UCS-2, which you can then read in Notepad or
elsewhere in the system as expected.

I just had to do this with an enormous Oracle text dump. It's a sick hack, but
it actually works. ::sigh::

Addison

----------------------------------
Addison P. Phillips
Director, Globalization Consulting
SimulTrans, LLC

+1 (650) 526-4652 (office phone)
http://www.simultrans.com (website)
mailto:AddisonP@simultrans.com (e-mail)

"22 languages. One release date."

-----Original Message-----
From: Montgomery Securities
Sent: Thu, 9 Sep 1999 14:11:49 -0700 (PDT)
To: unicode@unicode.org
Subject: RE: Identifying file encoding scheme

The side discussions that I have had with some of the very helpful members
of this list have led to some conclusions:

1) Windows NT uses some extra heuristics in addition to simply looking for
the signature at the beginning of the file to identify a file as
"Unicoded". These heuristics can, in the case of files that have textual
contents that are highly repetitive, lead to misidentification.
Unfortunately for me, database exports can sometimes be highly repetitive.

2) This is a problem with Windows NT, not notepad. If you create a file
that confuses NT's Unicode detection algorithm, and use the command "type
confused.txt > confuse2.txt" in order to make a copy of the file, the file
"confuse2.txt" is half the length of "confused.txt". For plain text files,
the lengths shouldn't differ at all. Remembering that "type" is not a
program, but rather a built in function of the command shell ("cmd.exe"),
this leads me to conclude that it's NT, and not any particular software.

3) For perl programmers, this program will generate a file that will
confuse NT:

unless (open(OUTFILE, ">c:\\confused.txt")) {die("cannot open file.\n");}
$c1 = "A";
$c2 = "B";
printf OUTFILE $c1 . ((($c1 x 3) . $c2) x 100) . "\n" . $c1 . ((($c1 x 3) .
$c2) x 100) . "\n";

The file looks something like this:
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB
AAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAABAAAB

Two lines, 401 characters each in length (not including the CR/LF),
consisting of one "A" followed by 100 "AAAB"s. There are many other
variations to this that will confuse NT, but this one is fairly easy to
create. You can even type it into notepad by hand, save it to disk, and
then try to read it right back in. A perfectly normal text file to everyone
but microsoft. In a command shell in NT typing the file only shows a bunch
of ?'s.

Does anyone know of any Unicode detection heuristics that are currently in
use by any software packages? This might help me rewrite the program that
exports the data in a way that won't confuse NT.

Thanks!

Michael Krebs

Michael Everson scripsit:

> Does that mean this e-mail confuses MS software?

Seemingly not: Windows NT 4.0 Notepad treats it resolutely as CP1252,
and I don't know why or how. Conversely, the oddball ASCII file
is treated as UCS-2, and I don't understand that either.

Microsoft folks?

--
John Cowan                                   cowan@ccil.org
       I am a member of a civilization. --David Brin

_______________________________________________________ Get Visto! Groups, event calendars, email, and more... Check it out @ http://www.visto.com/info



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT