RE: FW: Unicode Hangul and Internet

From: Chris Pratley (chrispr@microsoft.com)
Date: Fri Apr 23 1999 - 02:11:41 EDT


Excel97, like Word97, performs a very simple run-length compression on
Unicode in it binary file format. Streams of data above a certain length
that have zero in the high byte are marked as such and saved with just the
low bytes (which you might mistake for ASCII if you are looking in the
binary file with a hex editor). This was done strictly to minimize file size
bloat for users when the applications moved to Unicode from Office95's
code-page based file format. This type of compression has some advantages -
it is simple to code, fast, English text and numbers compress very well, and
there is no bloating for Asian languages. It is not the most efficient for
European or other non-English, non-Asian languages, but the focus at the
time was on keeping the file size down for files with numerical data and
English text without bloating Asian text.

Of course, whatever transformation is used internally in these binary file
formats is of no consequence since it is a private data store. All data
exported by these apps is always in one of the accepted forms of Unicode if
Unicode data is specified. The file formats themselves are documented in
MSDN for anyone who needs to read them without Excel/Word.

Chris Pratley
Microsoft Office Program Manager

-----Original Message-----
From: Peck, Jon [mailto:peck@spss.com]
Sent: April 21, 1999 8:59 AM
To: Unicode List
Subject: RE: FW: Unicode Hangul and Internet

Concerning the extract below, it appears that Excel, at least, does not keep
its files exactly in a Unicode format. It seems to use a format that
emphasizes ascii, but it doesn't use UTF-8. I am curious why UTF-8 wasn't
chosen.

- Kim Peck
SPSS Inc
peck@spss.com

-----Original Message-----
From: Chris Pratley [mailto:chrispr@microsoft.com]
Sent: Tuesday, April 20, 1999 11:29 PM
To: Unicode List
Subject: RE: FW: Unicode Hangul and Internet
[snip]
In reality, Unicode *is* being used all over the world. Without trying to
sound too grandiose, it is important to realize that Unicode is incredibly
widely used today - the people using it just don't realize it. Around the
world and in Asia, in Japan and all parts of China, it is a safe bet that
>50% of text being written today on computers is stored as Unicode
(Microsoft Word97, Word98, and JustSystems' Ichitaro). In Asia, over 50% of
Internet content, in particular around 65% of Korean content is now viewed
in Unicode (Internet Explorer), even if the web content itself is not stored
in Unicode. This trend is continuing: Word97 Korean is already Unicode,
Hangul and Computer plans to move their AreA Hangul word processor to
Unicode, and Navigator 5 will be based on Unicode. Besides word processors
and browsers, there are other programs used everyday in Korea like Excel97
and PowerPoint97 that use precomposed Unicode Hangul syllables. And
Office2000 will add Access2000, based on Unicode, using precomposed Hangul.
So, the myth of Unicode "non-adoption" is just a myth.

[snip]

Chris Pratley
Microsoft Office Program Manager



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT