Re: Character converter

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Tue Apr 06 1999 - 09:15:39 EDT


Michael Everson wrote on 1999-04-06 12:10 UTC:
> I got one answer, what to paste in the header. The other question, what
> CAPITAL LETTER D WITH DOT ABOVE is, remains unanswered....

Michael,

Is it essential that you use UTF-8?

Without UTF-8 and any special headers, you can always specify in HTML
these characters via a decimal numeric character reference.

The following tiny table contains for all Unicode characters of the form
"LATIN * WITH DOT ABOVE" the decimal entity reference, the hexadecimal
entity reference, and the UTF-8 character:

Ċ Ċ Ċ LATIN CAPITAL LETTER C WITH DOT ABOVE
ċ ċ ċ LATIN SMALL LETTER C WITH DOT ABOVE
Ė Ė Ė LATIN CAPITAL LETTER E WITH DOT ABOVE
ė ė ė LATIN SMALL LETTER E WITH DOT ABOVE
Ġ Ġ Ġ LATIN CAPITAL LETTER G WITH DOT ABOVE
ġ ġ ġ LATIN SMALL LETTER G WITH DOT ABOVE
İ İ İ LATIN CAPITAL LETTER I WITH DOT ABOVE
Ż Ż Ż LATIN CAPITAL LETTER Z WITH DOT ABOVE
ż ż ż LATIN SMALL LETTER Z WITH DOT ABOVE
Ḃ Ḃ Ḃ LATIN CAPITAL LETTER B WITH DOT ABOVE
ḃ ḃ ḃ LATIN SMALL LETTER B WITH DOT ABOVE
Ḋ Ḋ Ḋ LATIN CAPITAL LETTER D WITH DOT ABOVE
ḋ ḋ ḋ LATIN SMALL LETTER D WITH DOT ABOVE
Ḟ Ḟ Ḟ LATIN CAPITAL LETTER F WITH DOT ABOVE
ḟ ḟ ḟ LATIN SMALL LETTER F WITH DOT ABOVE
Ḣ Ḣ Ḣ LATIN CAPITAL LETTER H WITH DOT ABOVE
ḣ ḣ ḣ LATIN SMALL LETTER H WITH DOT ABOVE
Ṁ Ṁ Ṁ LATIN CAPITAL LETTER M WITH DOT ABOVE
ṁ ṁ ṁ LATIN SMALL LETTER M WITH DOT ABOVE
Ṅ Ṅ Ṅ LATIN CAPITAL LETTER N WITH DOT ABOVE
ṅ ṅ ṅ LATIN SMALL LETTER N WITH DOT ABOVE
Ṗ Ṗ Ṗ LATIN CAPITAL LETTER P WITH DOT ABOVE
ṗ ṗ ṗ LATIN SMALL LETTER P WITH DOT ABOVE
Ṙ Ṙ Ṙ LATIN CAPITAL LETTER R WITH DOT ABOVE
ṙ ṙ ṙ LATIN SMALL LETTER R WITH DOT ABOVE
Ṡ Ṡ Ṡ LATIN CAPITAL LETTER S WITH DOT ABOVE
ṡ ṡ ṡ LATIN SMALL LETTER S WITH DOT ABOVE
Ṫ Ṫ Ṫ LATIN CAPITAL LETTER T WITH DOT ABOVE
ṫ ṫ ṫ LATIN SMALL LETTER T WITH DOT ABOVE
Ẇ Ẇ Ẇ LATIN CAPITAL LETTER W WITH DOT ABOVE
ẇ ẇ ẇ LATIN SMALL LETTER W WITH DOT ABOVE
Ẋ Ẋ Ẋ LATIN CAPITAL LETTER X WITH DOT ABOVE
ẋ ẋ ẋ LATIN SMALL LETTER X WITH DOT ABOVE
Ẏ Ẏ Ẏ LATIN CAPITAL LETTER Y WITH DOT ABOVE
ẏ ẏ ẏ LATIN SMALL LETTER Y WITH DOT ABOVE
ẛ ẛ ẛ LATIN SMALL LETTER LONG S WITH DOT ABOVE

You can try to cut&paste the characters from this table in any of the
three forms into your raw HTML document with any 8-bit plain text
editor.

I can easily dump to you the entire Unicode table in such a form if this
is of any help.

I've just spent the last 3 minutes writing the following tiny Perl
program that produced this table. Perl is extremely useful for
transforming the Unicode database into anything in a few minutes.

#!/usr/bin/perl

# subroutine to convert an integer into a UTF-8 string

sub utf8 ($) {
    my $c = shift(@_);

    if ($c < 0x80) {
        return sprintf("%c", $c);
    } elsif ($c < 0x800) {
        return sprintf("%c%c", 0xc0 | ($c >> 6), 0x80 | ($c & 0x3f));
    } elsif ($c < 0x10000) {
        return sprintf("%c%c%c",
                       0xe0 | ($c >> 12),
                       0x80 | (($c >> 6) & 0x3f),
                       0x80 | ($c & 0x3f));
    } else {
        return utf8(0xfffd);
    }
}

# read list of all Unicode names (UnicodeData-Latest.txt) and
# output a list with NCRs (dec and hex) as well as UTF-8 and the name
while (<>) {
    if (/^([0-9,A-F]{4});([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*);([^;]*)$/) {
        next if ($2 eq "<control>");
        $ncr_dec = sprintf("&#%d;", hex($1));
        $ncr_hex = sprintf("&#x%x;", hex($1));
        printf("%s%s%s %s\n",
               $ncr_dec . (" " x (10-length($ncr_dec))),
               $ncr_hex . (" " x (10-length($ncr_hex))),
               utf8(hex($1)), $2);
    } else {
        die("Syntax error in line '$_' in file '$unicodedata'");
    }
}

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT