RE: Microsoft Word Query

From: Peter_Constable@sil.org
Date: Mon Mar 19 2001 - 11:43:41 EST


FYI: this is all the items related to text export / import in Word 2000
that I reported to MS back in January 1999. I was testing the corporate
preview beta; I later redid some of these tests with the release, and those
problems still existed.

I've had to remove one of the files that was originally attached since at
19K it exceeds this list's file size limit.

- Peter

----- Forwarded by Peter Constable/IntlAdmin/WCT on 03/19/2001 10:41 AM
-----

[snip]

All of the remaining items arise from tests that I did importing and
exporting text into various formats and encodings. For these tests, I used
a document containing every codepoint from U+0020 to U+FFFF, or (for import
tests) containing every 8-bit codepoint from x20 to xFF.

2) Request for different functionality: Re save as text only: On exporting
to (8-bit) text only (on a cp1252 system), the following translations take
place:

All characters map to the 8-bit codepoints for that character in cp1252
except for the following:

201E (x84 in cp1252) > x22 (dbl low-9 quotation mk > straight quotation
mk)
2026 (x85 in cp1252) > x2E x2E x2E (ellipsis > "...")
2018 (x91 in cp1252) > x27 (left single curly quote > single straight
quote)
2019 (x92 in cp1252) > x27 (right single curly quote > single straight
quote)
201C (x93 in cp1252) > x22 (left dbl curly quote > dbl straight quote)
201D (x94 in cp1252) > x22 (right dbl curly quote > dbl straight quote)
2013 (x96 in cp1252) > x2D (en dash > hyphen)
2014 (x97 in cp1252) > x2D (em dash > hyphen)
2122 (x99 in cp1252) > x28 x74 x6D x29 (trademark symbol > "(tm)")
200C (not in cp1252, but this is what I got by entering d157 = x9D from
keypad) > x3F (ZWNJ > "?")
200D (not in cp1252, but this is what I got by entering d158 = x9E from
keypad) > x3F (ZWJ > "?")

00A0 > x20 (non-breaking space > space)
00A9 > x28 x63 x29 (copyright symbol > "(c)")
00AE > x28 x72 x29 (registered symbol > "(r)")
00BC > x31 x2f x34 (fraction one quarter > "1/4")
00BD > x31 x2f x32 (fraction one half > "1/2")
00BE > x33 x2f x34 (fraction three quarters > "3/4")

I did not do any tests with any other codepage active. The same results
were achieved using Save as encoded text, and choosing Western European
(Windows) as the encoding option.

This is essentially the same behaviour as occurred with Word 97 (there are
three new translations changes in this version). While all of these make
some sense, they are not necessarily what a user would want. In many cases,
people may have these cp1252 characters in their document and want to keep
them intact when they export their document to text only. This is true for
people using characters in cp1252, but there are a lot of people out there
who are using fonts with non-standard character sets masquerading as
cp1252. In these situations, such translations probably would not make any
sense and would be very problematic.
It would be really nice if there was a way for users to turn off these
translations. I.e. be able to have all codepoints that could be entered by
ALT+0nnn on the keypad be mapped to dnnn when saving to text.

3) Bug report: Re Save as encoded text (Unicode): On exporting to Unicode,
11 non-ASCII characters were translated into ASCII characters:

U+00A0 > ' ' (U+0020)
U+00AB > '"' (U+0022)
U+00BB > '"' (U+0022)
U+2005 > ' ' (U+0020)
U+2013 > '-' (U+002D)
U+2014 > '-' (U+002D)
U+2018 > ''' (U+0027)
U+2019 > ''' (U+0027)
U+201C > '"' (U+0022)
U+201D > '"' (U+0022)
U+201E > '"' (U+0022)

I think it would be better if these translations did not take place. Note
that they consititute a violation of Unicode conformance requirement C10.

4) Bug report: Re Save as encoded text (UTF-8): On exporting to UTF-8,
there were a total of 16 non-ASCII characters that were translated to
ASCII:

U+00A0 > ' ' (U+0020)
U+00A9 > "(c)" (U+0028 U+0063 U+0029)
U+00AB > '"' (U+0022)
U+00AE > "(r)" (U+0028 U+0072 U+0029)
U+00BB > '"' (U+0022)
U+00BC > "1/4" (U+0031 U+002f U+0034)
U+00BD > "1/2" (U+0031 U+002f U+0032)
U+00BE > "3/4" (U+0033 U+002f U+0034)
U+2005 > ' ' (U+0020)
U+2013 > '-' (U+002D)
U+2014 > '-' (U+002D)
U+2018 > ''' (U+0027)
U+2019 > ''' (U+0027)
U+201C > '"' (U+0022)
U+201D > '"' (U+0022)
U+201E > '"' (U+0022)

Again, I don't think you really want these translations taking place, in
violation of Unicode conformance requirements.

5) Possible bug report (I'm not sure what's going on here, so I don't know
if this is a problem): Re Save as RTF: On exporting to RTF, some characters
that are not (as far as I am aware) part of any MBCS are stored as
\uc2\uN\'xx\'xx where N is the signed integer representation of the Unicode
value and xx are hex digits. For example, U+01D0 (LATIN SMALL LETTER I WITH
CARON) was stored as

\uc2\u464\'a8\'ab

If this is expected behaviour, I'd be interested in knowing what's going
on.

6) Bug report: Re problem opening a UTF-8 file: To open a UTF-8 file, I
chose the file type "Encoded text" and expected to see a dialog asking me
what the encoding of the file was. Instead, it proceeded to attempt to read
the file as though it were a text-only file. I tried putting the byte order
mark (in UTF-8) at the beginning of the file, but this made no difference.
I finally added the bytes xFF xFE at the beginning of the file, and this
caused the second dialog to appear with Unicode chosen as the encoding. At
this point, I could specify the encoding as UTF-8, and the entire file was
interpreted correctly. (The first two bytes got translated into a space,
U+0020.)

7) Bug report: Re opening 8-bit, text only file: This file contained every
character from x20 to xFF. There were too many errors to list them for you
- it was a real mess. Even some ASCII characters were handled incorrectly.
Find attached the file I tested with (All Codepage Characters.txt) and the
DOC file I saved upon opening this file (All Codepage Characters - opened
from txt.doc).

8) Bug report: Re opening RTF file: I had saved my document containing
every character from U+0020 to U+FFFF as RTF, and the RTF file appeared to
keep every character intact (though I had the behaviour described in item 5
above which I couldn't explain - that behaviour seemed to have no
relationship to the problems referred to here). So, I tried opening this
file. I was surprised to find 55 errors, all but two in the range U+00A0 to
U+00FF. Most of these came across as Arabic! Here is a log of the errors I
found:

Report on errors in Unicode characters
Problem with character U+00A1: is '?', value = x060C
Problem with character U+00AA: is '?', value = xF897
Problem with character U+00BA: is '?', value = x061B
Problem with character U+00BF: is '?', value = x061F
Problem with character U+00C0: is '?', value = xF898
Problem with character U+00C1: is '?', value = x0621
Problem with character U+00C2: is '?', value = x0622
Problem with character U+00C3: is '?', value = x0623
Problem with character U+00C4: is '?', value = x0624
Problem with character U+00C5: is '?', value = x0625
Problem with character U+00C6: is '?', value = x0626
Problem with character U+00C7: is '?', value = x0627
Problem with character U+00C8: is '?', value = x0628
Problem with character U+00C9: is '?', value = x0629
Problem with character U+00CA: is '?', value = x062A
Problem with character U+00CB: is '?', value = x062B
Problem with character U+00CC: is '?', value = x062C
Problem with character U+00CD: is '?', value = x062D
Problem with character U+00CE: is '?', value = x062E
Problem with character U+00CF: is '?', value = x062F
Problem with character U+00D0: is '?', value = x0630
Problem with character U+00D1: is '?', value = x0631
Problem with character U+00D2: is '?', value = x0632
Problem with character U+00D3: is '?', value = x0633
Problem with character U+00D4: is '?', value = x0634
Problem with character U+00D5: is '?', value = x0635
Problem with character U+00D6: is '?', value = x0636
Problem with character U+00D8: is '?', value = x0637
Problem with character U+00D9: is '?', value = x0638
Problem with character U+00DA: is '?', value = x0639
Problem with character U+00DB: is '?', value = x063A
Problem with character U+00DC: is '?', value = x0640
Problem with character U+00DD: is '?', value = x0641
Problem with character U+00DE: is '?', value = x0642
Problem with character U+00DF: is '?', value = x0643
Problem with character U+00E1: is '?', value = x0644
Problem with character U+00E3: is '?', value = x0645
Problem with character U+00E4: is '?', value = x0646
Problem with character U+00E5: is '?', value = x0647
Problem with character U+00E6: is '?', value = x0648
Problem with character U+00EC: is '?', value = x0649
Problem with character U+00ED: is '?', value = x064A
Problem with character U+00F0: is '?', value = x064B
Problem with character U+00F1: is '?', value = x064C
Problem with character U+00F2: is '?', value = x064D
Problem with character U+00F3: is '?', value = x064E
Problem with character U+00F5: is '?', value = x064F
Problem with character U+00F6: is '?', value = x0650
Problem with character U+00F8: is '?', value = x0651
Problem with character U+00FA: is '?', value = x0652
Problem with character U+00FD: is '?', value = x200E
Problem with character U+00FE: is '?', value = x200F
Problem with character U+00FF: is '?', value = xF899
Problem with character U+0178: is '?', value = x009F
Problem with character U+02DC: is '?', value = x0098
55 errors found.

These are all the concerns I wanted to bring to your attention at this
time. I hope this is useful for you. If you are interested in copies of any
of the other files involved in my testing, I'd be glad to send them to you.

Regards,
Peter Constable
Non-Roman Script Initiative, SIL

(See attached file: All Codepage Characters.txt)





This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT