Re: UTF-8 code in HTML

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Apr 11 2000 - 18:22:41 EDT


fady, i have good news for you:
the document character set for html is unicode. this means that all numeric character references _are_ unicode code points. it also means that the best solution for you is trivial.

for your questions, see details below.

markus

Fady Elias wrote:
>
> Hello everybody,
> I have a couple of questions about UTF-8
>
> 1- how can I get the UTF-8 code point for each char.
> ex: what is the UTF-8 code point for the Japanese char 'A' with
> the unicode code point 3042
>

you use an html authoring tool and write or copy/paste your text and set the page encoding to utf-8. no magic. if you have pages in other codepages, you can use one of many conversion tools and convert from those encodings to utf-8 (and change your http announcement or meta tag).

if you want to do utf-8 by hand, look at the utf-8 algorithm for how you calculate the bytes, or use some tools like http://www.macchiato.com/mark/UnicodeConverter .

by the way, U+3042 is in utf-8 bytes by my manual calculation: 0xe3 0x81 0x82.

> 2 - how can I represent these values in HTML file. for example to
> represent the English char 'A' you type the value (&#65), how can i do the
> same using unicode or UTF-8 values .
>

well, you can use hexadecimal numeric character references where U+3042 becomes あ (see the pattern? :-)

older browsers understand only decimal ncrs, so you need to convert the hexadecimal number to decimal and get あ (i use the windows calculator for it...). this does not have anything to do with utf-8, you can do this with html pages in any encoding, and it is essentially what it means that "the html document character set is unicode".

really, you should use an authoring tool, set the page encoding to utf-8, and type and paste away.

> Thanks in advance
> Fady

you are welcome!

markus



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT