> As far as encoding goes (not considering input or complex
> rendering issues), Word 97 uses Unicode. That is the encoding
> it uses in its internal memory representation, regardless of OS
> it is running on. Ditto for later versions. The encoding form
> it uses is UCS-2; the next version of Word will support UTF-16.
> It can also input and output UTF-8.
So (what I meant was) suppose there is a plain-text file, foo.txt (or foo.doc)
and I open it in the File menu. How do:
know whether it's Unicode (and if Unicode, whether it's UTF-8 or UTF-16) or a
Windows Code Page? By inspection / statistical analysis? I can see how this
might work for telling the difference between a Windows Code Page and
UCS-2/UTF-16 (since the latter would include a lot of NULs), but once you
allow UTF-8 into the picture, it gets muddier, no?
What if the application guesses wrong? Can the user specify the encoding? A
short tour through the menus of Word 97 and WordPad didn't show any way to do
this; maybe I missed it, or there is a way in later versions?
Windows has "three worlds" of character sets: OEM code pages, Windows
code pages, and Unicode. I'm trying to get an idea of the degree to which
they overlap, and the degree to which they are distinct and separate:
. Across applications (Edit, NotePad, WordPad, Word)
. Across OS's (Win 95, 98, NT, 2000)
For example, does Edit in NT use OEM Code pages? Which versions of
WordPad and/or NotePad can read a Unicode file? etc etc. My motivation for
asking this is to write a clear statement of how plain text should be
imported into Windows from non-Microsoft platforms such as Unix, VMS, etc.
How much does it depend on the application that will be using it, and
the version of the application, and which Windows OS it is.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT