Re: Benefits of Unicode

From: Tex Texin (texin@progress.com)
Date: Sun Jan 28 2001 - 15:31:37 EST


Dear Francois,

1) I think anyone can post to the the Unicode list. You don't have to
be a member. I copied your question to the list so you can get
better answers than I have time to give. I am about to hop on a plane.

2) Yes, Unicode is for plain text. As you say, you can wrap markup or
other higher level protocols around plain text. XML for example does
default to using Unicode, it is specified in the standard.

You gave a number of examples of rich text wrapping around plain text,
none of them conflict with using Unicode inside the wrapping.

"Richard, Francois M" wrote:
>
> David, Tex,
>
> I am addressing the following questions to you directly as I believe I
> cannot post question to the Unicode list (not member) and as I believe it is
> related to your discussion of Unicode benefits...
> My question is related to the "plain text" principle of Unicode. I have been
> reading Unicode 3.0 and unicode.org web site very carefully and found out
> that Unicode and its conformance is for "plain text".
>
> But in the real world, there is little amount of plain text being exchanged.
> Most of plain text exchanged is actually wrapped into rich text: All markup
> languages (HTML, XML, ...), Rtf, Postscript, resource files...
>
> So if Unicode is for plain text only, I am assuming that Unicode is not
> concerned (meaning its conformance cannot be applied to ...) with HTML, XML,
> rtf, Postscript,... That is reducing its scope a lot.
>
> When I create and exchange an HTML file for instance:
> <HTML>
> <TITLE>bla</TITLE>
> </HTML>
>
> only 'bla' is plain text. To conform to Unicode, does it mean I have to use
> the Unicode character set and encoding ONLY for 'bla'? (which would indicate
> mixing of character encoding in one single file)
>
> Can Unicode conformance be applied to rtf (and how)?
>
> What about Postscript that let you encode any Unicode character with
> uni<code> syntax?
> What about Java ResourceBundle files that are always in ASCII but with
> '\uxxxx' notation for Unicode characters?
>
> So, I really would like to be able to understand the Unicode conformance:
> When I study a new format, what are the simple questions I should answer to
> find out if Unicode conformance is achieved?
>
> The second problem I can see with Unicode is the fact that although the
> character set is universal, the encoding forms are multiple (UTF-8, UTF-16
> and UTF-32).
>
> I hope you have some time to help answer these questions or direct me to
> source that will.
>
> Francois.
>
> > -----Original Message-----
> > From: David Starner [mailto:dstarner98@aasaa.ofe.org]
> > Sent: Saturday, January 27, 2001 7:28 PM
> > To: Unicode List
> > Cc: Unicode List
> > Subject: Re: Benefits of Unicode
> >
> >
> > On Sat, Jan 27, 2001 at 01:36:44PM -0800, Richard Cook wrote:
> > > Has anybody played devil's advocate to this, with a list of
> > "Failings of
> > > Unicode"? Are there any? :-) This question might in fact result in a
> > > longer Benefits list ....
> >
> > Here's a start (and it's true that a lot of these are double-edged,
> > and are the results of good design choices):
> >
> > * Failure to gain the trust of many Japenese and Chinese
> >
> > * Character placement randomness: frequently the order of characters
> > is random or appears so; often the only explanations are
> > older standards*
> > or evolution, and even when there is a reason, it's different
> > from block
> > to block. Stuff isn't always found in the rational block (for
> > example,
> > all the stuff in the Basic Latin block; or the characters in
> > the Letterlike
> > Symbols that should be in the Mathematical Alphanumeric block).
> >
> > * Too much dependence on ASCII and Latin-1
> > * The first 256 characters are essentially random. They
> > should be
> > sorted out into blocks like the rest of the characters.
> > * The Latin-1 block has special exceptions, like U+00B5
> > not being
> > canonically equivelent to U+03BC.
> > * Stuff like U+0027 and U+0060 add confusion as to what
> > should be
> > used for the apostrophe (a large number of Unicode
> > documents use
> > U+0027 indiscriminately where U+2019 or U+2032 would be more
> > "correct".)
> >
> > * A lot of canonical equivelent or compatiblity equivelent
> > characters exist that add complexity to Unicode for little
> > or no expressitivity gain.
> > * There's is no documentation on which compatibility
> > characters really shouldn't be used, although people
> > claim on unicode@unicode various characters that shouldn't
> > be used.
> >
> > * Having precomposed letters adds complexity without additional
> > expressitivity. You have to worry about two forms of a (a-umlaut)
> > and the like anywhere you want to do comparisons.
> >
> > * Unicode is too intimately connected with UTF-16. UTF-16 could have
> > been done by moving U+E000-U+FFFF down 800h spaces and starting the
> > supplementary characters at U+F800. (So U+F800 has some
> > character value,
> > but F800 is a surrogate code in UTF-16, and surrogates can
> > be combined
> > in some way to get U+F800.)
> >
> > * Sort of releated to the last one, foo with combining
> > character bar should
> > be foo with bar, not foo with apostraphe or foo with comma.
> > For example,
> > U+0165 is SMALL T WITH CARON and equals t plus combining
> > caron, but is
> > shown as t', making it impossible to represent SMALL T WITH
> > CARON in
> > Unicode.
> >
> > (One beefs that aren't really mine . . .)
> >
> > * A lot of characters are combined in ways that make it harder to use
> > in a way that requires careful rich text to look decent.
> > Combining of one
> > Serbian character with a Russian character (the one that
> > looks differnt
> > in italics), the mixing of CJK ideographs, the combining of
> > Coptic and
> > Greek, the mixing of Fraktur and Latin characters (which would have
> > a big deal 75 years ago).
> >
> > As you can probably see, I'm a theorist with a distinct view
> > of what the
> > universal character encoding should look like. Fortunetly, unlike the
> > Rosetta guy and the Tron people, I know that Unicode is the
> > right combination
> > of support and 90% solution to win, even against superior
> > opposition, and
> > that I am not a worldwide expert on languages, and any attempt I made
> > would be hopelessly cribbed from Unicode (see the Tron
> > character set) and
> > ineptly designed out of the Latin block.
> >
> > > > Tex Texin wrote:
> > > > Any applications
> > > > reading the same
> > > > An ISO standard Standards insure text file will
> > > > interoperability
> > > > interpret it
> > > > correctly
> >
> > Depending on your audience, I might cut this back some.
> > First, the whole
> > "Standards insure interoperability" almost makes me laugh.
> > Even with the
> > POSIX and ISO C++ standards out there, moving programs
> > depending on those
> > standards from one system to another can be difficult. And those are
> > standards that are generally respected - the extended Pascal standard
> > and BASIC standard often don't even get lib service paid to them.
> >
> > Also, "any applications reading the same text file will interpret it
> > correctly" isn't true; almost any program reading Latin/Greek/Chinese
> > text will handle it (provided the right fonts), but many won't handle
> > RTL languages, many won't handle combining characters, many
> > won't handle
> > proper formatting for Arabic and Indic languages, and many
> > won't handle
> > suppchars.
> >
> > --
> > David Starner - dstarner98@aasaa.ofe.org
> > Pointless website: http://dvdeug.dhis.org
> >

-- 
According to Murphy, nothing goes according to Hoyle.
--------------------------------------------------------------------------
Tex Texin                      Director, International Business
mailto:Texin@Progress.com      +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.        14 Oak Park, Bedford, MA 01730

http://www.Progress.com #1 Embedded Database

Globalization Program http://www.Progress.com/partners/globalization.htm ---------------------------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT