Re: UTF-8 code in HTML

From: Andrew Cunningham (andjc@ozemail.com.au)
Date: Sun Apr 16 2000 - 08:52:38 EDT


Hi

Both Addison and Mark raise interesting issues.

To actually create web utf-8 web pages .. can be a complicated process
swapping between a suite of programs to create one page. Currently, there is
no one idea solution for creating web pages.

But then again, there are some languages out there, where utf-8 web pages
are the only viable solution.

I've been trying to grapple with a schema describing what is going on with
multilingual web sites. Nearly ll teh discussion of web sites I've seen
involves corporate/business sites .. with the focus of localising or
internationalising sites. And a lot of the solutions out there are based on
this. Even the discussion of legacy character encodings versus utf-8 is a
sample of this.

As a rule, when designing web pages for a foereign market, eg hong kong for
instance, it is safe to assume that the people in Hong Kong have their
browsers configured to display chinese characters using legacy systems. So
the argument goes is it safe to assume utf-8 support as well? With the state
of development .. maybe .. maybe not ...

Mark raised the idea that translated instructions, Addison pointed out the
problems with such a solution. I found it interesting.The current biases in
multilingual web design showing through.

The multilingual web pages I deal with are written in Australia for an
Australian audience whose primary langauge is not English. By default the
software distributed here is English only, without multilingual support
enabled.

With local multilingual community information sites, it can not be assumed
that the end user has his computer properly configured for a specific
language, irregardless of wether you are using utf-8 or a legacy character
encoding.

A well designed multilingual community/government site focusing on the local
multilingual communities must have instructions on configuring browsers to
use the site. I can think of a numebr of government sites here in Australia
that have been designed by web designers without an awareness of local NESB
communities information usage patterns.

The other factor involved here is that range of langauges. Community
language sites can be very diverse in the languages that need to be
supported. Actually community langauge sites can be more challenging than
commercial sites , due to the range of languages. And for these sites, utf-8
makes more sense.

Despite that lack of software and OS tools for effectively utilising utf-8.

Just my two cents worth, not that that's legal tender here anymore.

Ciao

Andrew Cunningham

----- Original Message -----
From: Addison Phillips [GSC] <addison@globalsight.com>
To: Unicode List <unicode@unicode.org>
Cc: Unicode List <unicode@unicode.org>; Bill Hall <billh@simultrans.com>
Sent: Sunday, 16 April 2000 9:25
Subject: Re: UTF-8 code in HTML

> Those are some useful ideas, Mark, but I doubt that it'll sway anyone to
> using UTF-8 sooner.
>
> The real issue here, if I understand correctly, is when to switch the
> Received Wisdom from "serve legacy" to "serve UTF-8". The relative ability
> of browsers it not really at issue here, since it is a problem that is a)
> correctable by the user and b) going away anyways.
>
> I actually see this as a server architecture problem. We have *plenty* of
> reasons left on the server side not to use UTF-8, even though Unicode
> encodings simplify our lives tremendously.
>
> Some examples:
>
> 1. Content creators ("page designers"----> not programmers) need to have
> tools that invisibly support UTF-8, including boring old text editors.
This
> is actually a serious obstacle: if the site is created and maintained in
> Latin-1: then we have to maintain a whole infrastructure for just that
> purpose. Do NOT tell me that translating 3000 HTML snippets into UTF-8
> "automagically" is the answer!
> 2. Our scripting and cgi languages need fixing. Perl 5.6 has the requisite
> support (but it is *brand* new). So many other Web technologies do not.
For
> example, I've got a guy busy next week lobotomizing PHP using ICU...
> 3. The JVMs need to be updated. Even NN4.7 still carries around a less
than
> completely recent JVM. Also next week I have a guy making a chat client
that
> sends everything to the server as an unprocessed stream (plus a locale
tag)
> because the server can run J2 and the client has to be able to run J1.0
> (basically). One significant issue is that I have to transmit the JVM
> version to the server so that more modern JVMs that can do
> real-honest-to-betsy-UTF8 can actually do the conversion themselves. You
> never know when the JRE is going to be installed...
> 4. Template languages processors need to be updated. Yes, UTF-8 can "sneak
> by" the processor in most cases, but what about things like toUpper( )?
> Awareness of Unicode is valuable here too.
> 5. URL encodings, storage, and the like. Some web servers are darned
cranky
> if the characters encoded in the %hh don't actually match the file system
> byte values. Universal use of UTF-8 here would *really* help.
>
> What I've actually been telling my customers for the last while is that
> UTF-8 is coming: it's a matter of time now. Legacy encodings are all very
> nice for specific applications, but use Unicode internally to build pages,
> interact with your database, etc. One area where I've been unsure has been
> whether to convert actual File System Assets (files stored on disk) to
> UTF-8, if I later intend to serve them as legacy encoding. For now I've
> defaulted to leaving them in code page and converting everything to match
> the container (file). But it is time to actually say "*everything* as
UTF-8"
> and convert it based on the browser version string as necessary.
>
> Let's say that the figures I posted earlier in the week are accurate. If
2/3
> of the Mac users run Netscape and 35% of the population runs IE5, then
that
> is awfully close to half of the eyeballs out there able to view UTF-8
> without interruption (in their own language). Another six months plus NN6
> should make this case compelling, no?
>
> I think restraint should be used, in the meantime, in telling people
simply
> that "UTF-8 solves all your problems"... like most I18N issues, it
> substitutes one set of problems for another. It's just (as Bill Hall
always
> says) that "one set of problems is *much* more interesting than the
other."
> ;-)
>
> Oh, a fly in Mark's ointment: how many versions of how many browsers would
> we have instructions for? IE is maddeningly different from version to
> version. Netscape is at least somewhat consistent, but it has several
> versions (and more limited localization. Do we show the English or the
> Japanese versions of the browser in Korea?)
>
> thanks,
>
> Addison
>
> Addison P. Phillips
> Senior Globalization Consultant
> Global Sight Corporation
> mailto:addison@globalsight.com
> ================================
> (+1) 408.350.3600 - Telephone
> http://www.globalsight.com
> ================================
> Going global with your web site? Global Sight provides Web-based software
> solutions that simplify the process, cut costs, and save time.
>
> ----- Original Message -----
> From: Mark Davis <markdavis@ispchannel.com>
> To: Glen Perkins <Glen.Perkins@NativeGuide.com>
> Cc: Unicode List <unicode@unicode.org>; Addison Phillips [GSC]
> <addison@globalsight.com>
> Sent: Saturday, April 15, 2000 1:51 PM
> Subject: Re: UTF-8 code in HTML
>
>
> > One thought:
> >
> > 1. Make a simple web page explaining how to set up different browsers
with
> the right fonts to read UTF-8.
> > 2. Make a button-like GIF that says something like "Display Problems?"
> with a link to the page.
> > 3. Get volunteers to translate this page and the text in the GIF into
> multiple languages.
> > 4. Post the pages and GIFs on the Unicode site in an accessible area.
> > 5. Encourage people to use the linked GIFs on their own sites, and/or
copy
> them and modify as they see fit.
> >
> > Do you think this kind of thing would help?
> >
> > Mark
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT