Word's binary format (at least Word97's) was described on Microsoft's DevNet
(e.g. MSDN Library - July 1999, search for: Microsoft Word 97 Binary File
Concerning the storage of code page/Unicode characters in the text area of
a Word document, we encountered the following pecularity:
The Microsoft specification of the Word binary file format says that "...
The text in a Word
document is ASCII text ...". That's obviously not true since Word is able to
handle Unicode characters. Word applies the following rule when storing
characters in a file (both modes are distinguished via the fExtChar flag in
the FIB, cf.
Binary File Format specs):
- If the document contains Windows CP 1252 code page characters only (on an
the 8-bit values of the characters are stored in the file.
Example: "abcäöü" would be stored as
(0400) 61 62 63 E4 F6 FC 00 00 00 00 00 00 00 00 00 00 abcäöü..........
- If the document contains true extended characters (e.g. Greek characters)
the entire text is stored as Unicode characters (UCS-2).
Example: "abc<alpha><beta><gamma>äöü" would be stored as
(0400) 61 00 62 00 63 00 B1 03 B2 03 B3 03 E4 00 F6 00 a.b.c.......ä.ö.
(0410) FC 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ü...............
From: Michael (michka) Kaplan [mailto:firstname.lastname@example.org]
Sent: Tuesday, December 19, 2000 3:02 PM
To: Unicode List
Cc: Winkler, Arnold F
Subject: Re: Effect on file size when using extended fonts in Word 2000
Word 2000 uses Unicode, and a somewhat bloated format for RTF as it always
have (extra tags around even the smallest pieces of text). To see more of
it, save your doc to HTML some time and look at the tags.... :-)
I believe the problem you are seeing has to do with limitations to
compression techniques more than anything else (text that is not all in the
same range will not have the kinds of similarities that make compression
But the actual internal storage is not documented and of course is entirely
subject to change. :-)
No matter what, if space is your main concern, then Word is really not your
ideal tool. I mean, I love Word to death (and am writing another book, 100%
in Word 2000 SA edition!) but not for its ability to be small in memory or
a new book on internationalization in VB at
----- Original Message -----
From: "Dembek, Raymond F" <Ray.Dembek@unisys.com>
To: "Unicode List" <email@example.com>
Cc: "Winkler, Arnold F" <Arnold.Winkler@unisys.com>
Sent: Tuesday, December 19, 2000 5:48 AM
Subject: Effect on file size when using extended fonts in Word 2000
> Does anyone know the implications on file size when you add characters
> 255 to a Word 2000 document on Windows 95/98/ME. Will this double the
> of the paragraphs that contain these characters?
> I am primarily concerned with adding linedraw characters to paragraphs
> in Courier New.
> We are getting some disproportionate increases in file size.
> For example when one of the characters in each 1000-character paragraph is
> replaced by with a character outside the lower 255, the file size doubles.
> Does Word 2000 store a paragraph as two-byte characters if one of the
> characters in it is a double-byte character?
> When I look at the RTF version of such a file it seems that only the
> characters that need two-bytes get special coding and at least the lower
> are all coded as ASCII.
> Please forgive the imprecise terminology in the above. This is still a
> confusing area for me.
> Regards and thanks,
> Ray Dembek
> Raymond F. Dembek
> Unisys Corp. - Michigan
> Voice: 1-248-661-9302
> "We e-eat, e-sleep and e-drink this e-stuff."
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:17 EDT