Re: UTF-8: Michael takes the plunge

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Apr 06 1999 - 05:13:30 EDT

Next message: Michael Everson: "Re: Character converter"
Previous message: Frank Sledge: "Re: Viewing Extended Latin characters, was UTF-8..."
Maybe in reply to: Michael Everson: "UTF-8: Michael takes the plunge"
Next in thread: Michael Everson: "Re: UTF-8: Michael takes the plunge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Frank Sledge" wrote on 1999-04-06 02:14 UTC:
> My partner is in the same boat. He is planning to write a program
> in Chipmunk Basic that will convert an 8-bit document into UTF-8.
> I imagine the same could be done in Perl, C, Pascal or whatever,
> fairly simple task (at least in theory): suck in one byte from the
> input file, look up the UTF-8 equivalent and send it to the
> output file; repeat until end of input file.

Sounds like an excitingly challenging software-engineering project.
Estimated completion time including testing and documentation:
25 minutes.

Even faster (estimated installation time: 12.5 minutes):

GNU recode does provide UTF-8 <-> anything conversion

http://www.iro.umontreal.ca/contrib/recode/recode-3.4q.tar.gz

Another alternative is this Perl program that replaces HTML/SGML
numerical character references by the corresponding UTF-8 sequences and
is excellently suited to quickly enter UTF-8 test documents:

------------------------------------------------------------------
#!/usr/bin/perl
# Convert HTML numeric character identifiers to UTF-8. M. Kuhn, 1998

sub utf8 ($) {
my $c = shift(@_);

    if ($c < 0x80) {
        return sprintf("%c", $c);
    } elsif ($c < 0x800) {
        return sprintf("%c%c", 0xc0 | ($c >> 6), 0x80 | ($c & 0x3f));
    } elsif ($c < 0x10000) {
        return sprintf("%c%c%c",
                       0xe0 | ($c >> 12),
                       0x80 | (($c >> 6) & 0x3f),
                       0x80 | ($c & 0x3f));
    } else {
        return utf8(0xfffd);
    }
}

while (<>) {
    while (/&\#[xX]([0-9a-fA-F]+);/) {
        $c = hex($1);
        $utf = utf8($c);
        s/$&/$utf/;
    }
    while (/&\#([0-9]+);/) {
        $utf = utf8($1);
        s/$&/$utf/;
    }
    print;
};
------------------------------------------------------------------

You can get a Perl interpreter from

http://www.perl.com/pace/pub/perldocs/latest.html

and there is even a Mac version on

http://www.iis.ee.ethz.ch/~neeri/macintosh/perl.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

Next message: Michael Everson: "Re: Character converter"
Previous message: Frank Sledge: "Re: Viewing Extended Latin characters, was UTF-8..."
Maybe in reply to: Michael Everson: "UTF-8: Michael takes the plunge"
Next in thread: Michael Everson: "Re: UTF-8: Michael takes the plunge"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT