Re: conversion of ascii to utf-8

From: Jukka.Korpela@hut.fi
Date: Mon Dec 11 2000 - 02:32:00 EST


On Sun, 10 Dec 2000, sreekant wrote:

> The 7-bit ASCII text on the web is the only common format
> that is truly supported in all countries.

More or less so, although not even all ASCII characters are "safe",
due to different "national variants of ASCII". Technically, the
variants are not ASCII but variants of ISO 646 differing from
ISO 646 IRV ("International Reference Version"), which is equivalent to
ASCII. In practical terms, however, people often use the word
"ASCII" to refer to _different_ character codes - even to some 8-bit
codes quite often! See
http://www.hut.fi/u/jkorpela/chars.html#ascii

For such reasons, the name US-ASCII is preferred for clarity. Formally
just a synonym for ASCII, it tries to make it clear that it refers to
an American standard which may differ from what people in other countries
think of as ASCII. Note that in the MIME registry too,
http://www.isi.edu/in-notes/iana/assignments/character-sets
US-ASCII is the preferred name.

> For unicode compatibility the utf-8 format is normally preferred.

In a sense yes. See RFC 2277, "IETF Policy on Character Sets and
Languages" (available e.g. from http://www.faqs.org/rfcs/rfc2277.html ).
But this does not exclude the use of US-ASCII where applicable. The policy
might - now I'm adding a lot of my interpretation - be read as
recommending UTF-8 as the encoding to be primarily considered _when_
US-ASCII does not give you a sufficiently rich character repertoire

> What approach is
> to be followed to convert the ascii representation of characters into
> the utf-8 format to enable the unicode compatibility?

No conversion is needed. An US-ASCII data stream _is_ UTF-8 encoded too.
One of the basic reasons behind recommending UTF-8 is that US-ASCII
characters are invariant in it: every character in the range U+0000 to
U+007F (the US-ASCII range) is presented directly as one octet
in the UTF-8 encoding.

-- 
Yucca, http://www.hut.fi/u/jkorpela/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT