1. "Unicode" is not always 16-bits. There are several
encodings. The encoding used on the web is usually UTF-8, which is a
multi-byte, 8-bit encoding. You should NOT send UTF-16/UCS-2 (that's the
16-bit variety of Unicode) to a browser, because none of the major
browsers will understand (actually, that's a simplification...)
2. If you set the charset for a page to a specific encoding, say UTF-8,
then, unless the user physically changes the encoding using their View
menu (rendering the display illegible unless it's in English), that's the
encoding that you get back from the browser.
You don't have to do anything. The browser handles the text conversion
from the user's input character set to Unicode for you.
This is also how Japanese users, for example, can use a web site like
Yahoo Japan (which is encoded as EUC-JP, whereas most Japanese PC's and
Macs use the character set commonly referred to as Shift-JIS... ).
In other words: you don't have to do anything. The browser and operating
system do it all for you. The user will never be aware that their input is
being converted to Unicode unless they look at the source of the HTML page
and see the META tag. All you have to do is pick up the results on the
Note that, unless you are using a Unicode encoding, you will have to
change your character parsing algorithm for each and every character set
(and thus language) that you intend to support. And you won't be able to
store that data in the same database with data created in another language
(with some obvious exceptions to that rule). Unicode solves a whole bunch
of problems on the server side.
I urge you to get a copy of the "Unicode 3.0" book and Ken Lunde's
excellent "CJKV Information Processing", both of which explain goodly
chunks of this. And check out the internationalization section of the
Addison P. Phillips Principal Consultant
Inter-Locale LLC http://www.inter-locale.com
Los Gatos, CA, USA mailto:email@example.com
+1 408.210.3569 (mobile) +1 408.904.4762 (fax)
Globalization Engineering & Consulting Services
On Tue, 3 Oct 2000, George Zeigler wrote:
> I would like to understand something. If I do havea site in unicode, how
> do I get people to enter data in unicode? We can test for the 16 bits, but I
> would not know how to instruct someone to enter data specifically in unicode
> character set.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT