RE: Translated IUC10 Web pages: Experimental Results

From: Chris Pratley (chrispr@MICROSOFT.com)
Date: Tue Feb 04 1997 - 18:04:35 EST


A few comments on these html files and Word97's capabilities.

Word97 supports UCS2 (little-endian) for textfiles
Word97 supports UTF-8 for HTML (but not UCS2)

This is why Word opens the true UTF-8 sites such as
http://www.cm.spyglass.com/unicode/iuc10/x-utf8.html
as Web pages, and the UCS2 little-endian pages as plain text.

Our assumption was that UTF-8 was the only Web-safe encoding that was
reasonably likely to be adopted by browsers in the near future. Is that
the consensus, or are raw UCS2 encodings being considered actively by
people on this alias?

Word97 will not open big-endian UCS2:
http://194.75.134.50/unicode/iuc10/x-ucs2.html

These are treated as text files by Word97, since it does not support
parsing UCS2 HTML
http://www.lang.duke.edu/unichtm/unilang.htm
http://194.75.134.50/unicode/iuc10/x-ucs2l.html

Also, it is interesting to note that
http://194.75.134.50/unicode/iuc10/x-ucs2l.html
contains a META tag claiming the file is UTF-8, although of course it is
not. This is one of the dangers of using META tags, or of changing
encodings of existing files without handling META tags, depending on
your viewpoint.

I'd be interested in the repro steps and environment that led to the
PPT97 paste via clipboard failing. I have no trouble doing this on my
Japanese NT4/Word97 setup.

Chris

                -----Original Message-----
                From: Lori Brownell

                FYI

                -----Original Message-----
                From: [SMTP:becker.osbu_north@xerox.com]
                Sent: 4 tebp`k 1997 c. 11:54
                To: Lori Brownell; charles.wicksteed@reuters.com;
misha.wolf@reuters.com
                Cc: unicode@unicode.org; becker.osbu_north@xerox.com
                Subject: Re: Translated IUC10 Web pages:
Experimental Results

                Thank you all, we're clearly well on the road though not
yet arrived. Here are
                a few observations with NT 4.0 and Office 97, using the
Bitstream Cyberbit font
                handed out at IUC9:

                Charles> I have added ...
                Charles> http://194.75.134.50/unicode/iuc10/x-ucs2l.html
                Charles> (UCS-2, least significant byte first,
MicrosoFFFE)

                Thank you for going to this trouble, my first
experiences with this are:

                    o Netscape 3.0 loads the page, shows the first
couple dozen characters (as
                ASCII/garbage); attempting to download it, Netscape
similarly truncates the
                file very early

                    o MS IE 3.0 cannot open the page

                    o Word 97 opens it (via the procedure below) as
correct Unicode plaintext
                HTML source

                        o Word 97 Save As ... Unicode Text correctly
writes this as a
                MicrosoFFFE file that can e.g. be read by NT Notepad

                        o Clipboard copy/paste to NT Notepad also works

                        o Clipboard paste to PowerPoint 97 is rejected
("error")

                Charles> http://194.75.134.50/unicode/iuc10/x-ucs2.html
                Charles> (UCS-2, most significant byte first)

                    o Word 97 opens the first several lines as correct
plaintext HTML source,
                then starts a huge stream of random bytes right in the
middle of the first
                <img> tag, namely after "... <img a" (i.e. it goes
bonkers after the "a" in
                "alt")

                Chris> Select this URL below
                Chris>
http://www.cm.spyglass.com/unicode/iuc10/x-utf8.html
                Chris> Edit/Copy
                Chris> File/Open (in Word97)
                Chris> Paste into the filename box
                Chris> OK

                This works beautifully, thank you! Word 97 Save As ...
Unicode Text also
                correctly writes this as a MicrosoFFFE text file, thus
providing perhaps the
                simplest path to extract all the text back out of this
page.

                I also tried these Unicode multilingual sample pages:

                http://www.lang.duke.edu/unichtm/unilang8.htm --
presence/absence of BOM
                unknown

                    o Netscape 3.0 (with Registry hack) loads the page
fine

                        o Clipboard copy/paste to NT Notepad treats text
as ASCII, i.e.
                high-order characters garbaged

                    o Word 97 opens the page as ASCII, high-order
characters garbaged

                http://www.lang.duke.edu/unichtm/unilang.htm --
little-endian UCS-2,
                presence/absence of BOM unknown

                    o Word 97 opens the page correctly

                Joe



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:33 EDT