RE: Effect on file size when using extended fonts in Word 2000

From: Keber, Wolfgang (Wolfgang.Keber@dialogika.de)
Date: Tue Dec 19 2000 - 10:29:02 EST


Word's binary format (at least Word97's) was described on Microsoft's DevNet

(e.g. MSDN Library - July 1999, search for: Microsoft Word 97 Binary File
Format).

Concerning the storage of code page/Unicode characters in the text area of
a Word document, we encountered the following pecularity:

The Microsoft specification of the Word binary file format says that "...
The text in a Word
document is ASCII text ...". That's obviously not true since Word is able to
internally
handle Unicode characters. Word applies the following rule when storing
Unicode
characters in a file (both modes are distinguished via the fExtChar flag in
the FIB, cf.
Binary File Format specs):

- If the document contains Windows CP 1252 code page characters only (on an
English Windows),
the 8-bit values of the characters are stored in the file.
Example: "abcäöü" would be stored as

...
(0400) 61 62 63 E4 F6 FC 00 00 00 00 00 00 00 00 00 00 abcäöü..........
...

- If the document contains true extended characters (e.g. Greek characters)
the entire text is stored as Unicode characters (UCS-2).
Example: "abc<alpha><beta><gamma>äöü" would be stored as

...
(0400) 61 00 62 00 63 00 B1 03 B2 03 B3 03 E4 00 F6 00 a.b.c.......ä.ö.
(0410) FC 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ü...............
...

Best regards,

wk

-----Original Message-----
From: Michael (michka) Kaplan [mailto:michka@trigeminal.com]
Sent: Tuesday, December 19, 2000 3:02 PM
To: Unicode List
Cc: Winkler, Arnold F
Subject: Re: Effect on file size when using extended fonts in Word 2000

Word 2000 uses Unicode, and a somewhat bloated format for RTF as it always
have (extra tags around even the smallest pieces of text). To see more of
it, save your doc to HTML some time and look at the tags.... :-)

I believe the problem you are seeing has to do with limitations to
compression techniques more than anything else (text that is not all in the
same range will not have the kinds of similarities that make compression
effective, perhaps?).

But the actual internal storage is not documented and of course is entirely
subject to change. :-)

No matter what, if space is your main concern, then Word is really not your
ideal tool. I mean, I love Word to death (and am writing another book, 100%
in Word 2000 SA edition!) but not for its ability to be small in memory or
on disk.

michka

a new book on internationalization in VB at
http://www.i18nWithVB.com/

----- Original Message -----
From: "Dembek, Raymond F" <Ray.Dembek@unisys.com>
To: "Unicode List" <unicode@unicode.org>
Cc: "Winkler, Arnold F" <Arnold.Winkler@unisys.com>
Sent: Tuesday, December 19, 2000 5:48 AM
Subject: Effect on file size when using extended fonts in Word 2000

> Does anyone know the implications on file size when you add characters
above
> 255 to a Word 2000 document on Windows 95/98/ME. Will this double the
size
> of the paragraphs that contain these characters?
>
> I am primarily concerned with adding linedraw characters to paragraphs
done
> in Courier New.
>
> We are getting some disproportionate increases in file size.
>
> For example when one of the characters in each 1000-character paragraph is
> replaced by with a character outside the lower 255, the file size doubles.
>
> Does Word 2000 store a paragraph as two-byte characters if one of the
> characters in it is a double-byte character?
>
> When I look at the RTF version of such a file it seems that only the
> characters that need two-bytes get special coding and at least the lower
127
> are all coded as ASCII.
>
> Please forgive the imprecise terminology in the above. This is still a
very
> confusing area for me.
>
> Regards and thanks,
>
> Ray Dembek
>
> Raymond F. Dembek
> Unisys Corp. - Michigan
> mailto:Ray.Dembek@Unisys.com
> Voice: 1-248-661-9302
>
> "We e-eat, e-sleep and e-drink this e-stuff."
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT