Re: Unicode Bidi Algorithm – Java reference implementation

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 19 Sep 2016 08:29:12 +0200

I note that there's a confusion in the introduction of UAX#9:

"On web pages, the explicit directional formatting characters (of all types
– embedding, override, and isolate) should be replaced by using the dir
attribute and the elements BDI and BDO."

The suggested replacements do not match the order of the listed types.
- embedding (with LRE/PDF or RLO/PDF) just uses the dir="ltr/rtl" attribute
on any element (except BDI and BDO)
- override (with LRO/PDF or RLO/PDF) uses BDO with
the dir="ltr/rtl" attribute
- explicit isolate (with LRI/PDI or RLI/PDI) uses BDI with
the dir="ltr/rtl" attribute
- "automatic" isolate (with FSI/PDI) uses BDI without any dir attribute

Two implicit directional characters (LRM or RLM) are also convertible to
overrides as an empty BDO element with dir="ltr/rtl". Only ALM has no
equivalent.

----
But for most cases, HTML documents should simply not use embedding or
override at all, isolates with BDI are much prefered and are in fact
simpler to manage than what section 6.4 suggests (this suggestion using RLM
or LRM before the separating punctuation does not work reliably as it
implies that you can predict the implicit reading direction of the whole
list, whose ordering is normally depending on the context or the document
containing the list. It is much simpler to isolate each list element and
then pack the list using the unmarked punctuations.
An example of this is found on International wikis thart must display some
inter-language bar to navigate to other translated versions of the same
page: the same template will be used on all pages, and the list of
languages is not predicted and may evolve over time, containing LTR or RTL
language names in unpredictable occurences anywhere in the list,
formatted  with the same separatorwithin a single inline span in a
paragraph starting by a translatable introduction heading, and you cannot
predict which language name will occur after that separator. Using BDI
(without even needing any dir=rtl/trl") or FSI/PDI to isolate each language
name will work much better than using uncondiionnaly some static RLM or LRM
before the separating punctuation (note that there's no such punctuation at
start of the list, so the ordering of the first element is not set
correctly unless there's a RLM or LRM also before that first element, which
may then render incorrectly).
The best and most flexible solution is to use "automatic" isolates for each
list item (with FSI/PDI in plain-text documents, or BDI elements without
any dir attribute in HTML documents). The same is also true when inserting
quotations (including when giving the title of another document, or the
name of an author) or for formatting translatable text containing
"placeholder variables" whose content will be generated separately. BDI
elements without any dir attribute can efficiently replace SPAN elements,
and can still have their own optional formatting styles (colors, font
families, font size, line height, font styles and weight, visual
effects...), or title attributes (to give hints to readers about what the
isolate value will be used for), or identifier (useful to generate stable
anchors that work across all translations of the document).
There are also CSS styles using unicode-bidi properties, but they should be
completely avoided in HTML (these styles will be better infered from BDI
elements)
2016-09-19 2:16 GMT+02:00 Ken Whistler <kenwhistler_at_att.net>:
>
> On 9/17/2016 10:26 AM, Deepak Jois wrote:
>
>> I now need to make the updates to support the changes in Unicode 8.0,
>> and I am finding it a bit hard to grok the changes in C at a glance.
>>
>>
> The UBA 7.0 --> UBA 8.0 changes were rather subtle. They did not change
> much about the gross behavior of the algorithm, but there were some fixes
> for edge cases in a couple rules. Also, the specification of behavior on
> stack overflow became exact, rather than implementation-defined.
>
> The C bidi reference code is a bit complicated, because it supports *all*
> UBA versions from 6.2 through 8.0, which means it has to special case rule
> processing by versions when the specification itself changes.
>
> If you diff the 7.0 version of brrule.c and the 8.0 version of brrule.c
> you'll find the heart of the differences there, along with explanations in
> comments for the changes. The new function br_SetBracketPairBC handles an
> edge case for combining marks following a bracket. The code using a new
> flag testONisNotRequired deals with an edge case for the current Bidi_Class
> of brackets being tested for pairing. Changes in br_PushBracketStack are
> involved in the need to keep the pre-8.0 behavior as it was for earlier
> versions of bidiref, but allowing for explicit behavior for stack overflow
> for 8.0.
>
> It may also help to compare the 7.0 and 8.0 versions of UAX #9 itself, so
> you can see the textual changes in the specification of the rules. Try
> diffing:
>
> http://www.unicode.org/reports/tr9/tr9-31.html (7.0)
> http://www.unicode.org/reports/tr9/tr9-33.html (8.0)
>
> The significant changes there are in BD11, BD14, BD15, BD16, and in rules
> X5a, X5b, X6a, and N0. (The rest of the changes in the updated document are
> cosmetic.)
>
> --Ken
>
>
Received on Mon Sep 19 2016 - 01:29:59 CDT

This archive was generated by hypermail 2.2.0 : Mon Sep 19 2016 - 01:29:59 CDT