RE: Developing multilingual web sites

From: Chris Pratley (chrispr@MICROSOFT.com)
Date: Tue Mar 21 2000 - 23:13:11 EST


Aaron, as a software designer the idea that you are changing the system
locale and rebooting on a regular basis makes me cringe. Let me know if what
I describe below doesn't help.

First, I want to clarify that the "native encoding" of Windows2000 is
Unicode. What you are referring to as "native encoding" is actually the
emulated encoding, usually called the ANSI code page of the system. In Hong
Kong, this should be "Big5" encoding.

Next, when you copy to clipboard in Win2000, the clipboard actually contains
the same data in many formats: HTML (in Unicode UTF-8), ANSI plain text
(i.e. the emulated encoding), Unicode (UCS-2) plain text, RTF, etc. When you
paste, the receiving application chooses the format it prefers. Some
applications cannot handle all formats. Specifically, older or
non-multilingual applications may not handle the Unicode formats. I think
this is what is happening when you paste into your non-Microsoft
applications (if the data is truly destroyed). When you paste into Notepad
(a Unicode application), you have a different problem - the font in Notepad
is not set correctly for viewing your text. Change the Notepad font to one
that works.

In any case, if your goal is to *paste* ANSI (non-Unicode) text into your
applications, and they really do only handle ANSI plain text and not even
HTML, then you need to change the system locale as you mentioned. In my
opinion, it would be much easier to use a tool that supports Unicode.

However, if your tools support import of HTML or plain text, you do have
some other options:
One of the easiest ways is to paste or open the file in IE5, then do a Save
As, then pick the encoding you want. IE handles re-encoding the file the way
you want.
Another option is to open the file in Word2000, set the encoding property in
Tools/Options/General/Web Options/encoding, then save again as the new
encoding. This will add additional Word-generated HTML to your file, though,
so it may not work for your purposes. If you just want the plain text and
not HTML, you can try File/Save As.../Encoded text, then pick Big5. This
saves out plain text in the chosen "ANSI" encoding which you can then
import.
FrontPage 2000 can also do what you want, and one feature of FrointPage2000
intended for site developers is that it does not modify the HTML you load
into it unless you edit it. So you can load the file into FrontPage, then
save as another encoding (set the encoding in File/Properties/Language).
You can also read more in my articles on Office2000 on www.mutlilingual.com
issues 24 and 26.

Chris Pratley
Group Program Manager
Microsoft Word

Sent with Office2000 SR1 wordmail

-----Original Message-----
From: Aaron Delwiche [mailto:aaron@lemon.com.hk]
Sent: Tuesday, March 21, 2000 6:12 PM
To: Unicode List
Subject: Re: Developing multilingual web sites

Hello,

I'd just like to thank Suzanne and everyone else on this list who took the
time to reply to my message.

Several people explained that it is difficult to preserve native code
formating when storing multiple languages in a single database. We will be
ultimately be using a content management system to drive the site, but my
question to the UNICODE list was motivated by the difficulties I faced in
producing a handful of static pages.

The real problems arose when I attempted to cut and paste text from one
application (Outlook Express or Microsoft Word) into another application
(Notepad, Homesite, or Adobe Photoshop). Korean and Japanese characters that
looked great in Unicode format immediately lost all of their formatting when
pasted into these other applications. I flirted with Windows 2000, Union Way
and RichWin, but nothing seemed to do the trick.

Nadine Kano's article "Multilingual Setup for Windows 2000 Professional" on
http://www.multilingual.com contained the solution to my problem. In order
to cut and paste in native code format, it is necessary to change the
"default system locale" in the regional settings section of the control
panel. According to Ms. Kano:

"Windows 2000 can emulate this [local] environment, but it can only emulate
one such environment at a time. In the vernacular, we label this environment
the default ANSI code page or the default system locale. In either case, you
indicate which ANSI-based environment Windows can emulate during this part
of setup. The default system locale also determines which character tables
Windows uses to map strings between ANSI-based character encodings and
Unicode, for example strings that an application might send to Windows
through a wide-character application program interface."

In retrospect, this seems like such a simple solution. Once the default
system locale was set correctly, I was able to cut and paste from Unicode to
native code without any further difficulties. The only drawback is that I
need to reboot my computer each time in order to change the default system
locale. I think it is possible to circumvent this limitation by creating
multiple accounts with different language settings (a process described in
Kano's article).

Thanks once again to everyone on this list for your help. The problems that
we are encountering in multilingual computing environments are an exciting
indicator of how globalization is transforming our world. I'm eager to learn
more about these issues, and am heartened to have discovered a community of
individuals who are helping to make the Internet a truly global
communications tool.

Best regards,

Aaron Delwiche
Director of Interface Development
___________________________________________
L E M O N

12/F Tin On Sing Commercial Bldg
41-43 Graham Street, Central, Hong Kong
tel (852) 2537-2313 fax (852) 2537-5678
__________________________________________

----- Original Message -----
From: "Suzanne Topping" <stopping@rochester.rr.com>
To: "Unicode List" <unicode@unicode.org>
Sent: Tuesday, March 21, 2000 11:04 PM
Subject: Re: Developing multilingual web sites

> Hi Aaron,
>
> It doesn't look like you received very many responses on list, so I'm
going
> to take a stab at it... I'm far from a web or Unicode expert, and so I
hope
> that the learned folks out there will correct any misinformation, or add
to
> my response.
>
> > I am currently working on a multilingual web site that includes content
in
> > five different languages (English, Traditional Chinese, Simplified
> Chinese,
> > Korean, and Japanese). I am developing these pages with a text editor
> (e.g.,
> > UltraEdit and Homesite) and graphics applications such as Adobe
Photoshop.
> > My operating system is Windows 2000 .
>
> >From your general description, I am assuming that you are not storing
> content in a database, is that correct?
>
> As someone pointed out to you, you'll need to make a decision about how
you
> want to store the data; as Unicode in a single file, or in multiple files
> using individual encodings. My understanding is that if you want to keep
all
> of the content in one file or database, you'll really need to use Unicode.
I
> don't believe you can store chunks of data within a single file/database
in
> multiple encodings. So that is issue number one; how are you going to
> structure and store data.
>
> > During the past week, I have received a batch of multilingual content
that
> > is mostly in Unicode format. I can easily view the content in Outlook
> > Express, and I can also view it Wordpad. However, when I export the
files
> to
> > text/HTML format, all of the unique character information is lost.
>
> I am assuming that this problem may have to do with the encoding settings
in
> the application that is doing the exporting, or the encoding settings in
the
> application which you later view the HTML. If the content is in Unicode,
> then all of these settings should also be set to Unicode.
>
> > I am searching for reliable tools that will export Unicode format to the
> > appropriate native code format (e.g. Big 5, Shift JIS).
>
> At what point do you plan to use these transforms? In the process that you
> have described, there is no mention of transforming the data. Again, if
you
> have all the data in a single file, there is probably not a good way to
get
> only the Chinese content to be transformed to Big 5, etc. How is your data
> segregated?
>
> If you do indeed want to serve data in native encodings, then you could
> consider using multiple URLs and serving each language as it's own site.
> (This would be the simplest solution.) Otherwise, you will have to find a
> way to dish up only the language that you need, transform it to the proper
> encoding, and can specify the native encoding method in a CSS.
>
> > 2. Can anyone point me to reliable software that will make it possible
to
> > create multilingual graphics with applications such as Adobe Photoshop
or
> > Macromedia Fireworks?
>
> Photoshop is useful because the designer can place the original text in a
> layer, and the localization process can then create a new layer with that
> text translated. You end up with a single Photoshop files containing all
of
> the languages, and you can simply turn layers on and off to get the
versions
> that you need.
>
> > 3. Does anyone know of useful on-line articles that provide a detailed
> > explanation of the differences between Unicode and the native character
> > sets?
>
> I don't know about on-line articles, but Andrea Vine has a great article
> called "Demystifying Character Sets" in the August/September 1999 issue of
> MultiLingual Computing and Technology magazine (#26, Volume 10, Issue 4.)
> (www.multilingual.com)
>
> > 4. What happens to native code text when it is copied and pasted to the
> > clipboard under Windows 2000? Does it preserve the native code format?
>
> Given Windows 2000 language support, one would think it would be
preserved,
> and proper handling of it would be more dictated by the settings of the
> application into which you later paste it. But once again I'll say that
I'm
> no expert.
>
> The same issue of MultiLingual Computing and Technology contains an
article
> by Chris Pratley of Microsoft called "Taking Advantage of Office 2000". A
> section in that article describes features for "Plain Text Open & Save".
It
> primarily discussed Word, and said that the encoding method of a file will
> be detected when you try to open it, and you can then change settings to
> save it to the desired encoding method using the Save As feature. It
didn't
> seem to mention pasted text.
>
> Unfortunately, this doesn't really answer your question, but I'm hoping
> perhaps Chris will see this response and provide a better answer.
>
> Good luck with your work!
>
> --++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Suzanne Topping
> Localization Unlimited
> (Globalization Process Improvement Consulting and Training)
>
> In association with BizWonk (TM)
>
> Phone: 716-473-0791
> Fax: 716-231-2013
> Email: stopping@rochester.rr.com
>
> (Send me an email to join the North East Localization Special Interest
> Group, an email distribution list which acts as a discussion forum for
> localization issues.)
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT