RE: Benefits of Unicode

From: Peter_Constable@sil.org
Date: Mon Jan 29 2001 - 11:42:20 EST


On 01/29/2001 09:50:48 AM "Richard, Francois M" wrote:

>That would be my next question: Although I might have an HTML file encoded
>in iso-8859-1, the parser has to interpret following the markup AND using
>the Unicode repertoire (CCS).
>Does this flexibility is taken into consideration anywhere into Unicode?

The CCS for HTML can be the Unicode repertoire, which means that an HTML
document is in principle capable of containing any character in that
repertoire, and that the characters must be interpreted per the
requirements of Unicode. But the HTML spec can still allow for the encoding
to be something other than a Unicode-sanctioned encoding form, and the
particular encoding form, e.g. iso-8859-1, might be capable of supporting
only a subset of the Unicode repertoire. That is in no way a problem for
Unicode: all this non-Unicode stuff is, in the view of Unicode, a higher
level protocol. Data has to be processed per the specifications of
iso-8859-1 and mapped into the Unicode CCS. Once this level of
interpretation is done, then Unicode's specification and conformance
requirements apply. In other words, by choosing Unicode as the CCS, this
means that once you've determined that a byte sequence in the HTML file
represents the Unicode character U+12A2 (say), then you can't treat it as
though it were (say) an Arabic letter beh.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT