"Frank Sledge" wrote on 1999-04-06 02:14 UTC:
> My partner is in the same boat. He is planning to write a program
> in Chipmunk Basic that will convert an 8-bit document into UTF-8.
> I imagine the same could be done in Perl, C, Pascal or whatever,
> fairly simple task (at least in theory): suck in one byte from the
> input file, look up the UTF-8 equivalent and send it to the
> output file; repeat until end of input file.
Sounds like an excitingly challenging software-engineering project.
Estimated completion time including testing and documentation:
25 minutes.
Even faster (estimated installation time: 12.5 minutes):
GNU recode does provide UTF-8 <-> anything conversion
http://www.iro.umontreal.ca/contrib/recode/recode-3.4q.tar.gz
Another alternative is this Perl program that replaces HTML/SGML
numerical character references by the corresponding UTF-8 sequences and
is excellently suited to quickly enter UTF-8 test documents:
------------------------------------------------------------------
#!/usr/bin/perl
# Convert HTML numeric character identifiers to UTF-8. M. Kuhn, 1998
sub utf8 ($) {
my $c = shift(@_);
if ($c < 0x80) {
return sprintf("%c", $c);
} elsif ($c < 0x800) {
return sprintf("%c%c", 0xc0 | ($c >> 6), 0x80 | ($c & 0x3f));
} elsif ($c < 0x10000) {
return sprintf("%c%c%c",
0xe0 | ($c >> 12),
0x80 | (($c >> 6) & 0x3f),
0x80 | ($c & 0x3f));
} else {
return utf8(0xfffd);
}
}
while (<>) {
while (/&\#[xX]([0-9a-fA-F]+);/) {
$c = hex($1);
$utf = utf8($c);
s/$&/$utf/;
}
while (/&\#([0-9]+);/) {
$utf = utf8($1);
s/$&/$utf/;
}
print;
};
------------------------------------------------------------------
You can get a Perl interpreter from
http://www.perl.com/pace/pub/perldocs/latest.html
and there is even a Mac version on
http://www.iis.ee.ethz.ch/~neeri/macintosh/perl.html
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT