Re: questions on implementing an embeded system that supports unicode

From: Addison Phillips (addison@yahoo-inc.com)
Date: Wed Aug 01 2007 - 12:42:28 CDT

Next message: Philippe Verdy: "RE: questions on implementing an embeded system that supports unicode"

Previous message: Magda Danish (Unicode): "Subj: Unicode form field validation in javascript"
In reply to: de Brebisson, Cyrille (Calculator Division): "questions on implementing an embeded system that supports unicode"
Next in thread: Philippe Verdy: "RE: questions on implementing an embeded system that supports unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Cyrille,

> I am working on an embedded system that supports the UTF-16
> subset of Unicode.

UTF-16 is a character encoding of the entire Unicode character set. It's
not a subset.

> - assuming that I use bitmap fonts (6*8 or so for Latin letters
> and 12*12~14*14 for asian, and I do not know how bit Arabic and
> similar letters needs to be), how much memory will I need to
> dedicate to the fonts?

"A lot"?

The real question is how much of Unicode your device is going to render.
Total coverage presents several problems, not the least of which are
right-to-left languages.

Some languages use what we call "complex" scripts, where the sequence of
characters influences how they are displayed. For example, Arabic
characters take a different shape depending on whether they are the
first, last, or middle characters in a string (or if they stand alone).
This means having a different "glyph" (bitmap) for each of the four ways
of presenting each Arabic Unicode character. In addition, the vowels
(when used) are rendered as "combining marks" (above or below the base
character). This is quite apart from the fact that Arabic is also a
right-to-left language. Rendering of complex scripts (and combining
marks, etc.) is not simply an issue of having a monospaced "cell" of
bits to display for each individual character in sequence (as it is for
most English or Far East Asian texts). And note that complex scripts are
not limited to right-to-left languages. The Devanagari script (used to
write Hindi, among other languages), for example, is both a
left-to-right script and a complex script.

In terms of font support, note that you probably don't need the full
range of Unicode. For example, many of the scripts in Plane 1 are of
historical interest only. I doubt whether your embedded application
really needs support for Cuneiform or Old Gothic. Similarly there are
scripts in the Basic Multilingual Plane that might not be of commercial
interest to you, as much as supporting them would be a Good Thing. For
example, Tifinagh or Ethopic scripts might not be something you need
right away. If you figure out which languages and scripts to support
(perhaps based on what rendering capabilities you can provide), you can
probably limit your font support to those characters.

> - how critical is the implementation of the RTL languages? It
> seems to add quite a lot of complexity to the system and, once
> again, in a low power embedded system might not be worth it?

The answer to that question depends on your needs. If you're going to
sell your products in these markets, it's kind of important. If not, you
can probably avoid it. Complex script markets do represent a growing
area of importance, so you probably should take a hard look at the
tradeoffs before deciding not to support one or another capability.

> - is there any existing software package that I could start
> with/use as a basis in my system so that I do not have to
> rewrite everything?

That's difficult to answer. You don't say what your programming language
or environment are like, although I assume that you have at least a C
compiler...

Many people start with the ICU open source library, as it implements
many of the features of Unicode. See:

http://www.icu-project.org

Hope that helps,

Addison

-- 
Addison Phillips
Globalization Architect -- Yahoo! Inc.
Chair -- W3C Internationalization Core WG
Internationalization is an architecture.
It is not a feature.
de Brebisson, Cyrille (Calculator Division) wrote:
> Hello,
> 
> I have a couple of questions, which will probably make it obvious that I am a newbie :)
> 
> I am working on an embedded system that supports the UTF-16 subset of Unicode. I have of course read the FAQ and lots of "high level articles" However, I till have a couple of questions, mainly related to displaying Unicode strings:
> - assuming that I use bitmap fonts (6*8 or so for Latin letters and 12*12~14*14 for asian, and I do not know how bit Arabic and similar letters needs to be), how much memory will I need to dedicate to the fonts?
> - how critical is the implementation of the RTL languages? It seems to add quite a lot of complexity to the system and, once again, in a low power embedded system might not be worth it?
> - is there any existing software package that I could start with/use as a basis in my system so that I do not have to rewrite everything?
> 
> Thanks, Cyrille
>

Next message: Philippe Verdy: "RE: questions on implementing an embeded system that supports unicode"
Previous message: Magda Danish (Unicode): "Subj: Unicode form field validation in javascript"
In reply to: de Brebisson, Cyrille (Calculator Division): "questions on implementing an embeded system that supports unicode"
Next in thread: Philippe Verdy: "RE: questions on implementing an embeded system that supports unicode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 12:45:44 CDT