RE: Benefits of Unicode

From: Richard, Francois M (
Date: Mon Jan 29 2001 - 11:14:26 EST

More questions...

> More to Francois:
> >> When I create and exchange an HTML file for instance:
> >> <HTML>
> >> <TITLE>bla</TITLE>
> >> </HTML>
> >>
> >> only 'bla' is plain text. To conform to Unicode, does it
> mean I have to
> use
> >> the Unicode character set and encoding ONLY for 'bla'? (which would
> indicate
> >> mixing of character encoding in one single file)
> The HTML spec uses textual markup (as opposed to some binary
> file format),
> so what constitute plain text depends on how you're
> interpreting an HTML
> file. To an HTML parser, in a sense, it's all plain text; that is, the
> parser has to interpret the plain text date to identify
> tokens like <HTML>.
> After that parsing has occured, then at a different level the
> file has been
> analysed into content portions and markup portions, and at
> this level only
> the content portions are seen as plain text.

That would be my next question: Although I might have an HTML file encoded
in iso-8859-1, the parser has to interpret following the markup AND using
the Unicode repertoire (CCS).
Does this flexibility is taken into consideration anywhere into Unicode?

> While it might be possible to create an HTML-like
> specification in which
> the markup and the content could potential be in different
> encodings (with
> some constraints: you need to avoid byte sequences in content
> that can be
> wrongly interpreted as markup), this is no the case for HTML
> or for XML:
> the entire file, markup and content, must be in the same encoding.
> >> The second problem I can see with Unicode is the fact that
> although the
> >> character set is universal, the encoding forms are multiple (UTF-8,
> UTF-16
> >> and UTF-32).
> How is that a problem? It is the kind of flexibility that
> makes Unicode
> very practical for implementers. It may be necessary to
> translate from one
> encoding form to another on occasion, but that is very simple.
Since there is more than one encoding for Unicode CCS, the encoding used has
to be declared. I thought the idea behind Unicode was to be unambiguous. I
see an unambiguous CCS, but not for CEF or CES.

And this is going back to my first question:
If a protocol or format allow different CEF and CES (not only UTFs) but
provide a way to access the whole Unicode CCS and agents have to interpret
it following the Unicode CCS, isn't this compliant (in a way. The "CCS"
way.) with Unicode?


> - Peter
> --------------------------------------------------------------
> -------------
> Peter Constable
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT