Re: Unicode to UTF-8

From: Glen Perkins (Glen.Perkins@NativeGuide.com)
Date: Thu Mar 16 2000 - 04:06:13 EST


You know I've solved this problem several times already, and I still end up
reaching for my Unicode book and the nearest piece of scrap paper almost
every time. ;-)

It's almost always when I need to test something. I'll have no real UTF-8
editor, and often no non-Latin-1 input method, so I'll want to type in the
hex digits for a few test cases and generate either the actual UTF-8 bytes,
or the hex representation, or some arrangement of both that I can use to
check against the output of some process. Just something to test a system or
a theory, which usually means just a quicky one-way converter in which I
just ignore surrogates, so it's nothing fancy.

I was actually hoping to see the UTF-8 values in the cells under the scalar
values in the new Unicode 3 book, because even though I've solved this
problem now several times, in Perl, C++, and Java, for both command line
unix piping and GUI Windows (& Java GUI on Mac once), using my own homemade
code, others' code, and system calls...I STILL never seem to be able to
quickly put my hands on whatever specific converter I need, for the machine
I'm on, when and where I discover I need it, as fast as I can just do it
myself on paper. For a conversion requiring a table lookup, I pretty much
have to take the time to go dig up the right tool, of course, but for a
UTF-16 (or upper Latin-1, just as often these days) to UTF-8 and back, it's
easier to just grab a scrap of paper and play the UTF-8 game, even though
it's about as much fun as long division. It would be easier if the UTF-8
were printed in the book.

I'd also like to have such a converter on my Palm V. These days my Palm
tends to follow me around. I'd like to have it, but I'm not feeling the urge
strongly enough to inspire me to figure out how to write a Palm app with
gcc, a compiler I don't have much use for these days, and I don't have Code
Warrior. John, how are you at Palm apps? ;-)

Either that or maybe you could just lend me a scrap of paper....

__Glen Perkins__

----- Original Message -----
From: John Cowan <jcowan@reutershealth.com>
To: Unicode List <unicode@unicode.org>
Sent: Wednesday, March 15, 2000 2:23 PM
Subject: Re: Unicode to UTF-8

> Kenneth Whistler wrote:
>
> > Someday I'll write myself a little command line convertor for this --
> > I spend way too much time hand converting these little examples
> > back and forth!
>
> Oh, very well, here it is:
>
> ---cut here---
> #!/usr/bin/perl
> # This silly script examines its first argument.
> # It converts a U+xxxx or U-xxxxxxxx string into UTF-8.
> # If the argument doesn't look like that, it's assumed
> # to be UTF-8 already, and is converted to UTF-16 and UTF-32 instead.
> # No significant error checking; do not use in production.
> #
> # John Cowan (cowan@ccil.org) wrote this because Ken Whistler and I got
> # tired of doing the job by hand all the time.
> # No copyright, no warranty, use as you will.
>
> unless (($_) = @ARGV) {
> die "usage: utf (U+xxxx | U-xxxxxxxx | xxxx...)\n";
> }
>
> if (/^U\+(....)$/) {
> $v = hex($1);
> if ($v < 0x80) {
> printf "%-2.2X\n", $v;
> }
> elsif ($v < 0x7ff) {
> $lead = 0xc0 + (($v >> 6) & 0x1f);
> $t1 = 0x80 + ($v & 0x3f);
> printf "%-2.2X %-2.2X\n", $lead, $t1;
> }
> else {
> $lead = 0xe0 + (($v >> 12) & 0xf);
> $t1 = 0x80 + (($v >> 6) & 0x3f);
> $t2 = 0x80 + ($v & 0x3f);
> printf "%-2.2X %-2.2X %-2.2X\n", $lead, $t1, $t2;
> }
> }
> elsif (/^U-(........)$/) {
> $v = hex($1);
> $lead = 0xf0 + (($v >> 18) & 0x7);
> $t1 = 0x80 + (($v >> 12) & 0x3f);
> $t2 = 0x80 + (($v >> 6) & 0x3f);
> $t3 = 0x80 + ($v & 0x3f);
> printf "%-2.2X %-2.2X %-2.2X %-2.2X\n", $lead, $t1, $t2, $t3;
> }
> else {
> if (/^(..)$/) {
> $lead = hex($1);
> printf "U+%-4.4X\n", $lead;
> }
> elsif (/^(..)(..)$/) {
> $lead = hex($1);
> $t1 = hex($2);
> printf "U+%-4.4X\n", (($lead & 0x1f) << 6) + ($t1 & 0x3f);
> }
> elsif (/^(..)(..)(..)$/) {
> $lead = hex($1);
> $t1 = hex($2);
> $t2 = hex($3);
> printf "U+%-4.4X\n", (($lead & 0xf) << 12) +
> (($t1 & 0x3f) << 6) + ($t2 & 0x3f);
> }
> elsif (/^(..)(..)(..)(..)$/) {
> $lead = hex($1);
> $t1 = hex($2);
> $t2 = hex($3);
> $t3 = hex($4);
> $v = (($lead & 0x3) << 18) + (($t1 & 0x3f) << 12) +
> (($t2 & 0x3f) << 6) + ($t3 & 0x3f);
> $s1 = 0xd800 + ((($v - 0x10000) >> 10) & 0x3ff);
> $s2 = 0xdc00 + ($v & 0x3ff);
> printf "U+%-4.4X U+%-4.4X\n", $s1, $s2;
> printf "U-%-8.8X\n", $v;
> }
> else {
> die "eh?\n";
> }
> }
> ---cut here---
>
> --
>
> Schlingt dreifach einen Kreis vom dies! || John Cowan
<jcowan@reutershealth.com>
> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com
> Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan
> Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT