Warning - long article.
This article deals with other issues too, but the proper implementation of
Unicode is one
of the main concerns, so I am copying the Unicode list too.
---------- Forwarded Message ----------
From: Jonathan Rosenne, 100320,1303
TO: HTML_WG, INTERNET:email@example.com
CC: Mati ALLOUCHE, INTERNET:G01@TELVM1.VNET.IBM.COM
Arik Ben-Dov, INTERNET:firstname.lastname@example.org
Dani Dolev, INTERNET:email@example.com
Mike Feldman, INTERNET:firstname.lastname@example.org
Stefan Fuchs, INTERNET:email@example.com
Israel Gidali, INTERNET:firstname.lastname@example.org
Nati Guedalia, INTERNET:email@example.com
Zvi Ilani, INTERNET:firstname.lastname@example.org
Dan Kalish, INTERNET:email@example.com
Jakob Kielmanzon, INTERNET:firstname.lastname@example.org
ILAN Hebrew List, INTERNET:ilan-h@VM.TAU.AC.IL
Eli Marmor, INTERNET:marmor@Elmar.co.il
Gil Mor, INTERNET:email@example.com
Yevgenia Palanker, INTERNET:firstname.lastname@example.org
Ella Pinski, INTERNET:STANDARD@NETVISION.NET.IL
Khaled Sherif, INTERNET:email@example.com
Elisha Yelin-Mor, INTERNET:firstname.lastname@example.org
DATE: 30/04/96 20:57
RE: Copy of: I18N of HTML - Hebrew
This article is a comment to "Internationalization of Hypertext
Markup Language", <draft-ietf-html-i18n-03.txt> dated 13 February
1996, with respect to the requirements and implementation of
bidirectionality (bidi, in short).
My comments address bidi from the point of view of Hebrew. I
expect my Arab colleagues to address their point of view.
In appendix A I briefly discuss bidi concepts, the implementation
of Hebrew and bidirectionality in Unicode and in SGML, and
provide some background information to those readers who are not
familiar with Hebrew and bidi.
I am posting this to the html-wg list, and comments, praise and
criticism may be posted there or sent to me.
I am sending this also to other people, but please respond
through the above mentioned channels.
Jonathan (Jony) Rosenne
P O Box 33641, Tel Aviv, Israel
The implementation of bidi should conform to the Unicode
specification. As proposed, there are some deviations.
Markup should provide higher level parameters rather than replace
low level functions.
Additional requirements are discussed.
Appendix A provides background information and some explanation
of the requirements and terminology.
Appendix B includes the entity names for the Hebrew characters
from SI 1680.
2. Formatting codes
The subject document, in clause 4.2. Markup for language-
dependent presentation, proposes removing several Unicode
formatting codes and replacing them with markup.
"On block-type elements, the DIR attribute indicates the base
directionality of the text in the block; if omitted it is
inherited from the parent element. On inline elements, it makes
the element start a new embedding level (to be explained below);
if omitted the inline element does not start a new embedding
The proposed attribute is ambiguous, it has two meanings
depending on whether the element it is attached to is considered
"inline" or "block-type".
"A set of named character entities is added that allows partial
support of the Unicode bidirectional algorithm [UNICODE], plus
some help with languages requiring contextual analysis for
The Unicode characters which are more or less equivalent to the
"inline element attribute" are apparently not allowed.
The full set of Unicode bidi formatting codes is required, in the
form of named character entities. These are characters in the
Unicode (UCS-2) character set, just as any other character, and
there is no way to exclude them and still claim compatibility
with those standards. Neither is it justified: they were included
in the standards because they were considered necessary.
Nor is the proposed change useful: since equivalent function is
provided, it is not a simplification. It is just different. And
it raises the question: What do we do with text that does include
these formatting codes?
The SII TC 1109 wish to state clearly: "We will not support
another Hebrew implicit algorithm".
The whole idea of standards is to avoid a "tower of Babel"
The specification should be modified as follows:
The complete Unicode/UCS-2 character set is supported.
Formatting code characters may be represented by named character
Bidi "inline element attributes" should be deleted. Bidi
attributes apply only to block-level elements, and provide base
directionality, as proposed in the subject document.
HTML tags provide the global direction for the document and the
base direction of each element.
In my opinion, these suggestions not only bring the proposal for
HTML I18N in line with Unicode, but they also simplify it
3. Language attribute
There is no one-to-one correspondence between the language and
the formatting codes.
ZWJ and ZWNJ could have use also in English - for example, in
"proper" English print adjacent f and l are joined to form an fl
ligature, and ZWNJ could be used to override this if separate
letters were desired in a system that implements typographical
In my experience with multi-lingual writing, the user does not
bother tagging short phrases embedded within another language.
The language attribute should not interact with the bidi
attributes, with one exception: the document level language
attribute (which the subject document suggests will be specified
in the HTTP Content-Language header) provides the global
direction. If the language is a right to left language, based on
the primary tag (such as iw or ar) the global direction is right
to left, otherwise it is left to right.
4. The DIR attribute
The proposed DIR attribute confuses the "block" level direction
(the base direction of the piece of text) with the character
The DIR attribute should be defined for "block-type elements" as
proposed, with the values LTR and RTL. Whether it is best to
attach it to the SPAN element or to each element of the document
I don't know. The DIR attribute would specify the base
directionality of an element of text.
For the lower level elements, the Unicode formatting codes as
defined in the Unicode standard provide a standard, extensively
discussed and well understood solution. Their implementation as
named entities makes them available with any character coding
Note: The DTD includes another DIR and I suggest it would be
better to choose another name for the directionality DIR.
5. Justification - the ALIGN attribute
When a paragraph has right to left directionality, its
justification is normally on the right. This applies to all lines
of a right justified paragraph, and to the last line of a fully
Thus in a right to left element the roles of the right and left
margins are interchanged, unless specifically overridden. If a
right to left document that is right justified includes a left to
right paragraph without explicit alignment, the expected
interpretation would be to left justify this paragraph.
6. Form fields
When Hebrew or Arabic text input is expected it should be right
justified within the window. The same applies when the field was
left to right but the user elected to type right to left text.
This means that fields should have a direction attribute, but if
the user chooses another language than expected the display
should follow the user's input. Of course, only if the other
language is allowed.
The same applies, of course, to default data that is displayed in
a field from the VALUE attribute.
I would also expect the browser to set the keyboard language to
match the base direction attribute of the field when the cursor
moves into the field. In a Hebrew system, which is normally
bilingual, Hebrew and English, when the field has right to left
direction the keyboard language should be set to Hebrew, and when
the direction is left to right the keyboard language should be
set to English.
With the data of each field, the form should return the direction
attribute actually selected by the user in addition to the
As proposed, the form should be able to restrict the user's input
to a specific character set, according to the requirements of the
The caveat in paragraph 5.2 should be expanded: In the case of
certain characters, their representation may be changed. For
example, there are two valid codings in UCS-2 for "e grave": a
composed character, and a base character followed by a diacritic.
Some people think the canonical form should be the composed form,
others prefer the decomposed form, but in any case it is possible
that the user agent will convert from one form to the other.
In a right to left form with selections, the check boxes and
radio buttons are on the right and the VALUE on the left, right
Tables with the right to left attribute should be arranged from
right to left, i.e. the first column is the rightmost column. In
the absence of specific attributes, each cell should by default
be justified according to its base direction.
Currently, the Hebrew standards for MIME are defined in:
RFC 1555 Hebrew Character Encoding for Internet Messages
RFC 1556 Handling of Bi-directional Texts in MIME
The following specification is compatible with HTML I18N:
Content-type: text/plain; charset=ISO-8859-8-i
The following specifications are not compatible and should not be
Content-type: text/plain; charset=ISO-8859-8-e
Content-type: text/plain; charset=ISO-8859-8
The display geometry in HTML assumes left to right direction.
In a right to left direction there are two possibilities:
- the geometry of the screen is not changed. It is up to the
author to lay out his text, images, forms and frames based on the
existing mechanisms to obtain the desired layout.
- the geometry is "mirrored". An application (for example, a CGI
application) designed for left to right will thus function
correctly in a right to left environment without requiring
An attribute is required on the document level to specify the
10. ALT text
Since Unicode will be the basic character set underlying HTML,
there is no reason to restrict the character set of the ALT text.
Of course, it is up to the author to design his page so that he
does not send the user text he cannot see. But if the text is in
Hebrew, there is no reason not to allow ALT text in Hebrew.
Although not really belonging to this document, I would like to
mention the need for UCS-2 coded URL's, at least the part after
the path. Since it is really of no interest to the intermediary
nodes, the only requirement should be that it be understood by
the server. If the server supports file names in the local
language, be it French, Hebrew or Japanese, why not?
12. Preformatted text
Text under the influence of a <PRE> tag and other tags indicating
preformatting should be processed on a line by line basis. It
should be considered preformatted only as far as HTML is
concerned, not on the character level.
Appendix - Background information
1. Who am I?
I am an independent consultant, working in Israel and involved
with national and international Hebrew standards for many years.
I was a member of the ECMA and ISO working groups that worked on
ECMA TR/53 and ISO/IEC 6429, and contributed to the formation of
the bidi and Hebrew elements in Unicode and ISO 10646. Currently
I am a member of the Standards Institution of Israel (SII)
technical committee 1109, which is responsible for basic
standards regarding Hebrew in computing.
SII TC 1109 has discussed the subject, and agrees in principle
with these comments. Due to the short time we had not reviewed
There is one point, however, that we did formally agree on: the
committee will not agree to yet another Hebrew character level
standard, which means that the implementation of Unicode in HTyML
should be 100% conformant to the Unicode specification.
2. Bidi Concepts
Hebrew and Arabic are written from right to left, whereas the
European languages are written in the opposite direction.
Moreover, numbers are written from left to right, and so are
phrases in European languages embedded in the right to left text.
This is why this characteristic of our languages is known as
"bidirectionality" - to properly support them it is necessary to
support both directions, which may be mixed from the character
Modern systems, such as Unicode and MS Windows, are based on
"implicit directionality", that is they know which characters are
Hebrew or Arabic and therefor are right to left, which characters
are numeric or Latin or Cyrillic or Greek and therefor are left
to right, and which characters, for example punctuation, have no
inherent directionality and are therefor neutral.
I have not mentioned the far east, languages such as Chinese,
Japanese and Korean, because as far as I am aware no one has done
any widely used implementation of multi-language processing
involving bidi and these languages, especially considering that
they may be written top to bottom. When they are written left to
right they are similar to the European languages.
The implicit directionality requires an algorithm to decide the
direction of each character and on the basis of these directions
render a line of output. A different algorithm will produce
different results and may cause the output to be unusable. The
Unicode specification includes a suitable algorithm, and we
strongly believe that everybody should stick to it. If there are
problems with this algorithm, let us discuss them, and if need be
propose a revision to Unicode.
The implicit directionality requires two further inputs to
correctly interpret the text: the "global direction" of the
document, and the "base direction" of the paragraph (or other
element of the document). The Unicode standard expects these to
be provided by higher level protocols (HTML is such a protocol)
although is does provide defaults when they are not.
The global direction affects the global aspects of the document,
such as the margins, the layout of the title page and of the
table of contents, etc. and provides a default base direction for
the elements of the document.
The base direction affects the implicit algorithm and the
justification of the paragraph.
To give an example: A Hebrew document has a global direction of
right to left. All its paragraphs and other elements would have a
base direction of right to left. The text may include phrases in
other languages, such as English, which are left to right.
If the document were to include a whole paragraph in English,
that paragraph would have a base direction of left to right, even
if it in turn included a right to left phrase in Hebrew.
The elements of a document may have a more elaborate hierarchy,
and in this case it would be expected that the base direction be
inherited by lower level elements unless explicitly specified.
In the past, Hebrew was implemented using "visual
directionality", in which the application is responsible for
handling the directionality of the text and providing a physical
representation of the text. This was necessary at the time, as
the logic capability of the I/O devices (such as DEC VT 100 or
IBM 3270) was limited and it was difficult even to implement
appropriate character generators. The visual implementation is no
longer appropriate and barely usable in modern systems, which
must accommodate the freedom of the user to select window size
and fonts according to his needs, and "cut and paste" operations
using the mouse. In these systems, bidi must be handled by the
system, where these functions are located, rather than by the
Visual processing of bidi is like an omelet: You can easily
convert an egg to an omelet, but it is not so easy to convert it
back. Once the text has gone through the process and been
rendered physically, it is very difficult to get back to the
The bidi override formatting codes should only be used in short
phrases when the implicit rendering does not produce satisfactory
results, for example with "part numbers", not to disguise
"physical ordering". Logical ordering is necessary to support
display device independence, cut & paste, and to allow the usual
(English oriented) search tools to be used.
3. Unicode and ISO 10646 (UCS-2)
Unicode is a character set standard. It includes the Hebrew
letters and "points", and the next release will add the Hebrew
Hebrew includes 27 letters - 22 plus 5 final forms. The Hebrew
"points" (Niqud) are additional signs, used only in special
occasions, indicating the precise pronunciation of some letters
that represent two different consonants, the vowels and stress.
In everyday Hebrew text they are used only rarely.
The cantillation marks are used in Biblical texts and serve two
purposes: they specify punctuation according to a scheme that is
more detailed than the scheme in use today (comma and full stop,
etc.), and indicate the tune for the ritual reading of the Bible.
In addition to these Hebrew characters, and of course the Arabic
characters, Unicode includes some directional formatting codes
"to control the ordering of characters when rendered". In
principle, Unicode specifies implicit ordering, based on the
properties of the characters, i.e. Hebrew and Arabic letters are
implicitly right to left, while digits and Latin, Greek and
Cyrillic letters are left to right. "However, ... there are
certain circumstances where an implicit bidirectional ordering is
not sufficient to produce comprehensible text". The quotes are
from the Unicode Version 1.0, appendix A. Similar words appear in
the draft I have of Unicode 2.0, clause 3.11.
Unicode incorporates a precisely defined implicit algorithm.
According to Unicode and ISO 10646, these formatting codes are
not control characters.
Arabic uses additional formatting codes which I do not feel
qualified to discuss.
ISO 10646 includes the same codes as Unicode, although it does
not specify the implicit algorithm.
However, and this is the main point that should concern the
implementation of Unicode bidirectionality, the Unicode
formatting codes deal with the bidirectional properties only at
the character level, as Unicode is a character code. Higher
levels are expected to be dealt with by higher level protocols,
such as SGML or a word processor.
The Unicode bidi algorithm starts with the base level, which is
"the default horizontal orientation of the text in the current
In the context of a document, such as an SGML general document or
an HTML document, the base direction should be a hierarchical
property of the individual elements.
The SII has recently approved Israeli Standard SI 1680, "Hebrew
Implementation in SGML". The SII is preparing an English summary
and a request to include our extensions in the relevant ISO
SI 1680 includes the following elements:
- additional tags to support bidi
- the specification of the global direction
- entity names for the Hebrew characters
- Hebrew tags for Hebrew general documents
The global direction of the document is derived from language
specification in the DTD (Hebrew being "iw").
SI 1680 proposes Hebrew tags, corresponding to the English tags
for general documents in ISO/IEC/TR 9573-1988. The use of a
Hebrew tag implies right to left base directionality for the
relevant document element.
A new tag, <ph>, indicates a Hebrew (right to left) paragraph,
whereas <p> indicates a left to right paragraph. The document
elements in the DTD include <ph> or <p> as appropriate to
indicate their base direction.
Since SGML is not restricted to Unicode, and other character
codes, such as ISO 8859-8, do not require implicit directionality
and do not include formatting codes, additional tags were
introduced to specify character level directionality, basically
in parallel with the Unicode formatting codes: RTL, LTR and IMD.
The default is, as in Unicode, implicit directionality. RTL and
LTR specify right to left and left to right directionality, the
same as Unicode RLO and LRO, </RTL> and </LTR> are equivalent to
PDF, and IMD (Implicit Directionality) specifies implicit
directionality. These tags may be nested.
SI 1680 also includes entity identifiers for the Hebrew
Appendix B. Entity names for the Hebrew Characters
The following names are taken from SI 1680:
<!ENTITY ALEF SDATA ALEF --=HEBREW LETTER ALEF-->
<!ENTITY BET SDATA BET --=HEBREW LETTER BET-->
<!ENTITY GIMEL SDATA GIMEL --=HEBREW LETTER GIMEL-->
<!ENTITY DALET SDATA DALET --=HEBREW LETTER DALET-->
<!ENTITY HE SDATA HE --=HEBREW LETTER HE-->
<!ENTITY VAV SDATA VAV --=HEBREW LETTER VAV-->
<!ENTITY ZAYIN SDATA ZAYIN --=HEBREW LETTER ZAYIN-->
<!ENTITY HET SDATA HET --=HEBREW LETTER HET-->
<!ENTITY TET SDATA TET --=HEBREW LETTER TET-->
<!ENTITY YOD SDATA YOD --=HEBREW LETTER YOD-->
<!ENTITY FINALKAF SDATA FINALKAF --=HEBREW LETTER FINALKAF-->
<!ENTITY KAF SDATA KAF --=HEBREW LETTER KAF-->
<!ENTITY LAMED SDATA LAMED --=HEBREW LETTER LAMED-->
<!ENTITY FINALMEM SDATA FINALMEM --=HEBREW LETTER FINALMEM-->
<!ENTITY MEM SDATA MEM --=HEBREW LETTER MEM-->
<!ENTITY FINALNUN SDATA FINALNUN --=HEBREW LETTER FINALNUN-->
<!ENTITY NUN SDATA NUN --=HEBREW LETTER NUN-->
<!ENTITY SAMEKH SDATA SAMEKH --=HEBREW LETTER SAMEKH-->
<!ENTITY AYIN SDATA AYIN --=HEBREW LETTER AYIN-->
<!ENTITY FINALPE SDATA FINALPE --=HEBREW LETTER FINALPE-->
<!ENTITY PE SDATA PE --=HEBREW LETTER PE-->
<!ENTITY FINALTSADI SDATA FINALTSADI --=HEBREW LETTER FINALTSADI-->
<!ENTITY TSADI SDATA TSADI --=HEBREW LETTER TSADI-->
<!ENTITY QOF SDATA QOF --=HEBREW LETTER QOF-->
<!ENTITY RESH SDATA RESH --=HEBREW LETTER RESH-->
<!ENTITY SHIN SDATA SHIN --=HEBREW LETTER SHIN-->
<!ENTITY TAV SDATA TAV --=HEBREW LETTER TAV-->
<!ENTITY SHINDOT SDATA [SHINDOT] --=HEBREW POINT SHIN DOT-->
<!ENTITY SINDOT SDATA [SINDOT] --=HEBREW POINT SIN DOT-->
<!ENTITY DAGESH SDATA [DAGESH] --=HEBREW POINT DAGESH
OR MAPIQ OR SHURUQ-->
<!ENTITY RAFE SDATA [RAFE] --=HEBREW POINT RAFE-->
<!ENTITY SHEVA SDATA [SHEVA] --=HEBREW POINT SHEVA-->
<!ENTITY HATAFP SDATA [HATAFPATAH] --=HEBREW POINT HATAF PATAH-->
<!ENTITY HATAFS SDATA [HATAFSEGOL] --=HEBREW POINT HATAF SEGOL-->
<!ENTITY HATAFQ SDATA [HATAFQAMATS] --=HEBREW POINT HATAF QAMATS-->
<!ENTITY PATAH SDATA [PATAH] --=HEBREW POINT PATAH-->
<!ENTITY QAMATS SDATA [QAMATS] --=HEBREW POINT QAMATS-->
<!ENTITY SEGOL SDATA [SEGOL] --=HEBREW POINT SEGOL-->
<!ENTITY TSERE SDATA [TSERE] --=HEBREW POINT TSERE-->
<!ENTITY HIRIQ SDATA [HIRIQ] --=HEBREW POINT HIRIQ-->
<!ENTITY HOLAM SDATA [HOLAM] --=HEBREW POINT HOLAM-->
<!ENTITY QUBUTS SDATA [QUBUTS] --=HEBREW POINT QUBUTS-->
<!ENTITY METEG SDATA [METEG] --=HEBREW POINT METEG-->
<!ENTITY SOFP SDATA [SOFPASUQ] --=HEBREW PUNCTUATION SOF PASUQ-->
<!ENTITY MAQAF SDATA [MAQAF] --=HEBREW PUNCTUATION MAQAF-->
<!ENTITY ETNAHTA SDATA [ETNAHTA] --=HEBREW ACCENT ETNAHTA-->
<!ENTITY SEGOLA SDATA [SEGOLA] --=HEBREW ACCENT SEGOL-->
<!ENTITY SHALSHELET SDATA [SHALSHELET] --=HEBREW ACCENT SHALSHELET-->
<!ENTITY ZAQEFQ SDATA [ZAQEFQATAN] --=HEBREW ACCCENT ZAQEF QATAN-->
<!ENTITY ZAQEFG SDATA [ZAQEFGADOL] --=HEBREW ACCENT ZAQEF GADOL-->
<!ENTITY TIPEHA SDATA [TIPEHA] --=HEBREW ACCENT TIPEHA-->
<!ENTITY REVIA SDATA [REVIA] --=HEBREW ACCENT REVIA-->
<!ENTITY ZARQA SDATA [ZARQA] --=HEBREW ACCENT ZARQA-->
<!ENTITY PASHTA SDATA [PASHTA] --=HEBREW ACCENT PASHTA-->
<!ENTITY YETIV SDATA [YETIV] --=HEBREW ACCENT YETIV-->
<!ENTITY TEVIR SDATA [TEVIR] --=HEBREW ACCENT TEVIR-->
<!ENTITY GERESH SDATA [GERESH] --=HEBREW ACCENT GERESH-->
<!ENTITY GERESHM SDATA [GERESHMUQDAM] --=HEBREW ACCENT GERESH MUQDAM-->
<!ENTITY GERSHAYIM SDATA [GERSHAYIM] --=HEBREW ACCENT GERSHAYIM-->
<!ENTITY QARNEY SDATA [QARNEY] --=HEBREW ACCENT QARNEY-PARA-->
<!ENTITY TELISHAG SDATA [TELISHAGEDOLA] --=HEBREW ACCENT TELISHA GEDOLA-->
<!ENTITY PAZER SDATA [PAZER] --=HEBREW ACCENT PAZER-->
<!ENTITY MUNAH SDATA [MUNAH] --=HEBREW ACCENT MUNAH-->
<!ENTITY MAHAPAKH SDATA [MAHAPAKH] --=HEBREW ACCENT MAHAPAKH-->
<!ENTITY MERKHA SDATA [MERKHA] --=HEBREW ACCENT MERKHA-->
<!ENTITY MERKHAK SDATA [MERKHAKEFULA] --=HEBREW ACCENT MERKHA KEFULA-->
<!ENTITY DARGA SDATA [DARGA] --=HEBREW ACCENT DARGA-->
<!ENTITY QADMA SDATA [QADMA] --=HEBREW ACCENT QADMA-->
<!ENTITY TELISHAQ SDATA [TELISHAQETANA] --=HEBREW ACCENT TELISHA QETANA-->
<!ENTITY YERAHB SDATA [YERAHBENYOMO] --=HEBREW ACCENT YERAH BEN YOMO-->
<!ENTITY OLE SDATA [OLE] --=HEBREW ACCENT OLE-->
<!ENTITY ILUY SDATA [ILUY] --=HEBREW ACCENT ILUY-->
<!ENTITY DEHI SDATA [DEHI] --=HEBREW ACCENT DEHI-->
<!ENTITY ZINOR SDATA [ZINOR] --=HEBREW ACCENT ZINOR-->
<!ENTITY MASORAC SDATA [MASORACIRCLE] --=HEBREW ACCENT MASORA CIRCLE-->
<!ENTITY PASEQ SDATA [PASEQ] --=HEBREW PUNCTUATION PASEQ-->
<!ENTITY UPPERDOT SDATA [UPPERDOT] --=HEBREW MARK UPPER DOT-->
Jonathan (Jony) Rosenne
P O Box 33641, Tel Aviv, Israel
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT