Re: character mapping

From: Jukka.Korpela@hut.fi
Date: Thu Dec 14 2000 - 03:15:35 EST


On Wed, 13 Dec 2000, sreekant wrote:

> I have typed some characters in telugu in some language
> editor and stored it as a html page.

This might mean various things, and some of them have nothing to
do with Unicode. I suppose it would in most cases mean that you
have used an editor which displays octets (bytes) as Telugu
characters according to some mechanism which typically treats
octets 0 - 127 according to the ASCII standard and octets
128 - 255 according to some special convention, "private agreement"
so to say. Often this just means that a specific _font_ is used,
with glyphs for Telugu characters in those positions. This would
_not_ mean using Unicode in any sense.

For example http://www.geocities.com/Athens/Delphi/2627/index3.html
seems to refer to using Telugu fonts that way.

Such schemes have been widely used and are still used, for various
script, but they suffer from serious problems. Without going into all the
details, I'd just refer to some relevant documents (especially as regards
to using such approaches on the Web):
http://www.dantobias.com/webtips/char.html
http://babel.alis.com/web_ml/html/fontface.en.html
http://ppewww.ph.gla.ac.uk/%7eflavell/charset/internat.html
http://www.hut.fi/u/jkorpela/HTML/chars.var

In Unicode, Telugu characters are in the code positions
U+0C00..U+0C7F, see http://charts.unicode.org/Web/U0C00.html
and something essentially different is needed then.

> I have NOT given any charset in the <meta tag> but the telugu font is
> INSTALLED on my machine.I am able to see the characters in the browser.

But can others? One of the many reasons for developing Unicode was
to provide a way to use rich (or just different) character repertoires
without needing tricks and without needing to download fonts on an
ad hoc basis and switch between fonts.

> because when u give the charset the character mapping takes place
> according to the charset, how it is working over here when no character
> set is defined?

Trickery. Basically, if I have guessed correctly, your browser doesn't
understand at all what it is doing. It simply operates on octets and
displays them according to the font in use. Technically, it _would_
be possible to indicate the encoding (which tells what different
octets _logically_ mean, as abstract characters) in HTTP headers
or in META tags that are supposed to simulate HTTP headers.
But as far as I know, there is no registered encoding corresponding
to the use discussed here, and even if there were, it would hardly be
understood by browsers.

It might be a good idea to have two versions of a page, one using
the font technique (or trickery), and one using some Unicode encoding.
It would be relatively easy to generate the "fontistic" version from
the Unicode version; you would probably want to use some Unicode-capable
editor for creating the Unicode version, see e.g.
http://www.hclrss.demon.co.uk/unicode/utilities_editors.html

-- 
Yucca, http://www.hut.fi/u/jkorpela/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT