Re: UTF16 <=> Reuters format?

From: Roman Czyborra (czyborra@cs.tu-berlin.de)
Date: Wed Sep 30 1998 - 09:38:58 EDT


Dear Arnt:

> > ftp://ftp.unicode.org/Public/PROGRAMS/CVTUTF/
> I'm looking for something related: reference code or the algorithm
> for converting between UTF16 and the compact Reuters format, which
> I've heard is either part of Unicode 2.1 or scheduled to become part
> of Unicode. Is that available anywhere?

Go one directory up and take the exit "SCSU".

ftp://ftp.unicode.org/Public/PROGRAMS/SCSU/ contains a sample Java
implementation which can convert text files back and forth between
UTF-16 and SCSU.

I think that initialDynamicOffset[1] in SCSU.java has to be changed
from 0x0100, // Latin Extended A
into 0x00C0, // combined partial Latin-1/-A
to align it with http://www.unicode.org/unicode/reports/tr6.html
but I haven't heard the final word on this.

The SCSU/*.java user interface is a bit object-oriented:

        $ cd SCSU
        $ javac *.java
        $ echo test > test.csu
        $ java CompressMain /expand test.csu < /dev/null
        Expanded test.csu: 5 bytes to test.txt 5 chars. Ratio: 200%.
        Done. Press enter to exit
        $ od -t x1 test.txt
        0000000 fe ff 00 74 00 65 00 73 00 74 00 0a
        0000014
        $ java CompressMain /compress test.txt < /dev/null
        Compressed test.txt: 5 chars to test.csu 5 bytes. Ratio: 50%.
        Done. Press enter to exit
        $ od -t x1 test.csu
        0000000 74 65 73 74 0a
        0000005

If you also take a look at http://czyborra.com/scsu/ you will find a
simpler deflator http://czyborra.com/scsu/scsu.c in C that translates
SCSU standard input into UTF-8 standard output and does not require
you to store the text in files nor to use certain extensions:

        $ echo sch÷n | scsu
        sch├Ân

You could easily change its putwchar function to output UTF-16 instead
of UTF-8, see http://czyborra.com/utf/#UTF-16

I did not yet program a compressor to SCSU because it is probably an
exercise in combinatorial optimization to do that well and I currently
find other questions more important.

A trivial UTF-16 to SCSU converter simply inserts an SCU single change
to Unicode mode in the beginning:

        $ sed '1s/^\(■ \)*//' test.txt | od -t x1
        0000000 0f 00 74 00 65 00 73 00 74 00 0a
        0000013

Cheers
Roman



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT