L2/10-427

Comments on Public Review Issues
(August 3, 2010 - October 27, 2010)

The sections below contain comments received on the open Public Review Issues and other feedback as of October 27, 2010, since the previous cumulative document was issued prior to UTC #124 (August 2010).

Contents:

172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing
174 Proposed Draft UTR #49: Unicode Character Categories
175 CLDR 1.9 Collation Changes
Feedback TUS 6.0 Beta and Charts
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports


172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

Date/Time: Wed Sep 15 07:40:54 CDT 2010
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: pri 172 comment

The mapping file has mappings for Hangul letters (abbreviated here):

3131 ; mapped ; 1100 # 1.1 HANGUL LETTER KIYEOK
3132 ; mapped ; 1101 # 1.1 HANGUL LETTER SSANGKIYEOK
...
318C ; mapped ; 1194 # 1.1 HANGUL LETTER YU-I
318D ; mapped ; 119E # 1.1 HANGUL LETTER ARAEA
318E ; mapped ; 11A1 # 1.1 HANGUL LETTER ARAEAE

These mappings are inappropriate (and I'd say useless and surprising). If anything, mappings to full Hangul syllable blocks with Hangul Jamo fillers would have been appropriate (e.g. 3131 -> <1100,1160>). However, the Jamo fillers are disallowed in IDN.

I would therefore suggest that the Hangul letters (subset of the ranges U+3131--U+318E and U+FFA1--U+FFDC) be disallowed (even though they were mapped in IDNA2003).

Date/Time: Fri Oct 22 05:13:05 CDT 2010
Contact: taliskermoon@hotmail.co.uk
Name: John Daw
Report Type: Other Question, Problem, or Feedback
Opt Subject: procedural question re: IDNA 2008 implementation

NOTE: I believe the Editorial Committee has answered this query.

Dear Sir or Madam,

I was looking at the Unicode.org website, and am a little unclear what impact the IDNA 2008 policy will have on domain name registrant's who have a domain that does not fall under the policy's permissible code-points, particularly symbol-based domain names.

The page http://icann.org/en/topics/idn/fast-track/idna-protocol-2003.txt explains what codes are permissible, but what in practice, will it mean if someone types in e.g. €.com in their browser under IDNA 2008? Will it not even allow the domain registrant to satisfy the query by perhaps forwarding the user on to a different domain name?

I'm curious to know how such domain names have been disabled, and prevented them from being shown/resolving.

I will hope to hear your reply in due course.

Regards,

John Daw

Date/Time: Tue Oct 26 12:37:04 CDT 2010
Contact: steffen@earthlingsoft.net
Name: Steffen Kamp
Report Type: Public Review Issue
Opt Subject: 172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

Hi,

I am not sure if the review period is still open, however I have some questions regarding the Public Review Issue 172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing.

The current draft at http://www.unicode.org/reports/tr46/tr46-4.html (and also in earlier versions) does not clearly state what the output of the ToASCII process should be. In Section 4.2 ToASCII, step 3 states: "Convert each label with non-ASCII characters into Punycode [RFC3492]. This may record an error." What should then be done to the labels (if no error occured)? Should Punycode encoded labels be prefixed with the ACE label "xn--"? Should the individual labels be concatenated using U+002E FULL STOP?

I also have some doubts regarding the conformance test file idnaTest.txt:

Test case 2 in this file is as follows:

B; FASS.DE; fass.de;

The fourth column here is empty, implying that the result of the ToASCII conversion should be identical to the source, in this case "FASS.DE". However, to my understanding the first step in the ToASCII conversion is the processing from Section 4 (according to the idnaMappingTable.txt file) which maps uppercase ASCII to lowercase, resulting in "fass.de", so I do not understand how the ToASCII conversion of "FASS.DE" can result in the source string. (The same applies to several other test cases.)

The test file states:

"# Column 3: toUnicode - the result of applying toUnicode to the source, using the specified type
# Column 4: toASCII - the result of applying toASCII to the source, using nontransitional"

However, the specification of ToUnicode in section 4.3 of UTS46 states "Apply the Nontransitional Processing" while ToASCII in section 4.2 may use either transitional or nontransitional processing. So it seems to me that it should be the other way round: column 3 (toUnicode) should always use nontransitional processing, while column 4 (toASCII) should use the type (transitional/nontransitional) given in column 1.

Best regards,
Steffen

174 Proposed Draft UTR #49: Unicode Character Categories

No feedback was received via the reporting form this period.

175 CLDR 1.9 Collation Changes

Date/Time: Wed Sep 15 08:31:13 CDT 2010
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: pri 175 comment

NOTE: Kent reported that this was fixed in CLDR 1.9, Changeset [5130] by mdavis http://www.unicode.org/cldr/trac/changeset/5130 so the report is included here only for completeness.

In http://www.unicode.org/review/pr-175/pinyinCollation.txt
(picking one example, there are many more instances):

<*一弌伊衣医吚𠰄壱𢨮依祎咿𠲔𠲖㛄𡜬㳖洢䧇𣐿𣢷𧉅悘猗䚷郼铱壹㥋揖欹䒾蛜㾨禕㙠嫛𢊘漪稦銥嬄𣘦噫𠿣夁𢣉瑿䃜𧜤鹥繄䫑檹毉䉗䔱𧫦醫𪁚黟譩𡄵𩥯𩮵䪰鷖𩕲黳𧮒𪈨⼀⾐ #yī

There is a primary difference between 一 and ⼀, and they are not
next to each other either.

In http://www.unicode.org/review/pr-175/strokeCollation.txt
they have just a tertiary difference:

&一<<<⼀

Likewise for 衣 and ⾐: &衣<<<⾐.

<*一弌伊衣医吚𠰄壱𢨮依祎咿𠲔𠲖㛄𡜬㳖洢䧇𣐿𣢷𧉅悘猗䚷郼铱壹㥋揖欹䒾蛜㾨禕㙠嫛𢊘漪稦銥嬄𣘦噫𠿣夁𢣉瑿䃜𧜤鹥繄䫑檹毉䉗䔱𧫦醫𪁚黟譩𡄵𩥯𩮵䪰鷖𩕲黳𧮒𪈨   #yī

would suffice, dealing with ⼀ and ⾐ later on at the tertiary level.

For the radical/stroke collation, there are lines like (just
quoting one example here)

<*⼀一 #'1.0'

Later on there are added tertiary tailorings like

&一<<<㆒
&一<<<⼀

That makes

<*⼀一 #'1.0'

(with a primary difference between those two characters)
superfluous and confusing.

<*一    #'1.0'

would suffice, dealing with ⼀ later on at the tertiary level.

However, it seems like that the tertiary tailorings in
pinyinCollation.txt and radicalStrokeCollation.txt are subsets
of the tertiary tailing in strokeCollation.txt. In particular
(using ... here since the list is rather long):

&母<<<⺟
&龟<<<⻳
&一<<<⼀
&丨<<<⼁
&丶<<<⼂
&丿<<<⼃
&乙<<<⼄
&亅<<<⼅
&二<<<⼆
&亠<<<⼇
&人<<<⼈
&儿<<<⼉
&入<<<⼊
&八<<<⼋
&冂<<<⼌
...
&齒<<<⿒
&龍<<<⿓
&龜<<<⿔
&龠<<<⿕

seems to be missing in the two first-mentioned files.
They should be in all three files.

Feedback TUS 6.0 Beta and Charts

Date/Time: Wed Sep 1 11:05:16 CDT 2010
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Names List currency symbol characters

In NamesList-6.0.0d10.txt

0024 DOLLAR SIGN * other currency symbol characters: 20A0-20B8

00A4 CURRENCY SIGN * other currency symbol characters: 20A0-20B5

should be updated to: * other currency symbol characters: 20A0-20B9

Date/Time: Sun Sep 5 17:02:29 CDT 2010
Contact: cewcathar@hotmail.com
Name:
Report Type: Public Review Issue
Opt Subject: 172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing

Hi, I only have one very trivial comment on grammar:

1.3.2 Deviations, 1rst par "There are a few situations where the use of IDNA2008 without compatibility mapping will result in the resolution of IDNs to different IP addresses than in IDNA2003, unless the registry or registrant takes special action."

{ COMMENT: word order -- it's better not to separate "different" and the accompanying preposition (in technical writing, for clarity; this does not apply to literary writing); also we previously discussed "different from" "different than" and "different to" on the ltru list I think; "different than" was pretty much used in American English, "different to" in British/U.K. English, and "different from" thus was universal; so change "different than" to "different from"; finally insert "those" to refer back to "IP addresses" for technical clarity }

=>

"There are a few situations where the use of IDNA2008 without compatibility mapping will result in the resolution of IDNs IP addresses different from those in IDNA2003, unless the registry or registrant takes special action."

* * *

COMMENTS ABOUT CONTENT

{ I'll have a few more of these I hope when I check #39 . . . }

* * *

6, Mapping Table Derivation, Step 3, par 3, Characters that are disallowed in IDNA2003 (Step 3.3 above)

"Bidi Control characters ◦U+200E LEFT-TO-RIGHT MARK..U+200F RIGHT-TO-LEFT MARK ◦U+202A LEFT-TO-RIGHT EMBEDDING..U+202E RIGHT-TO-LEFT OVERRIDE"

{ COMMENT: Hope these will continue to be disallowed . . . I guess they are through 2008 . . . }

* * *

Also, 8, Format, par 3, 4th item

"Bn for Bidi (in IDNA2008)"

{ COMMENT: Did not bidi errors defined? Should these be defined? }

Best,

C. E. Whitehead
cewcathar@hotmail.com

Date/Time: Tue Sep 7 15:00:36 CDT 2010
Contact: liancu@microsoft.com
Name: Laurentiu Iancu
Report Type: Error Report
Opt Subject: Minor names-list issue in Playing Cards block

NamesList-6.0.0d10.txt lists the jokers as part of the diamonds and clubs suits, respectively. As jokers do not normally belong to any particular suit, they should be listed by themselves in separate sections. This is minor and can wait until a future revision of the names list.

Date/Time: Wed Sep 8 15:42:22 CDT 2010
Contact: liancu@microsoft.com
Name: Laurentiu Iancu
Report Type: Error Report
Opt Subject: Currency Symbols code chart: inconsistent casing

All existing currency names in the textual annotations in the Unicode 6.0 Currency Symbols code chart are spelled in lowercase. Rupee should follow suit.

Date/Time: Thu Sep 9 04:42:46 CDT 2010
Contact: bpjonsson@gmail com
Name: Benct Philip Jonsson
Report Type: Error Report
Opt Subject: Swedish spellings for 0267

I found a factual error in NamesList.txt.

http://unicode.org/Public/UNIDATA/NamesList.txt

> > 0267 LATIN SMALL LETTER HENG WITH HOOK > > * voiceless coarticulated velar and palatoalveolar fricative > > * "tj" or "kj" or "sj" in some Swedish dialects

While there are many spellings in Swedish, "sj" being considered the most typical, which can be pronounced as 0267, and there are many socially and regionally distributed allophones of this phoneme

the spellings "tj" and "kj" are *never* pronounced as 0267. They are pronounced as 0255 (voiceless alveolo-palatal fricative), which is a distinct phonome.

Confusion may have arisen because there are spellings "stj" and "skj" which indeed are pronounced as 0267, but these are distinct graphies. Also there is a perhaps growing minority of people who have a voiceless velar 0058 or uvular fricative 03C7 w/o coarticulations for the "sj" phoneme and a voiceless postalveolar fricative 0283 for the "tj" phoneme, while the voiceless postalveolar fricative 0283 is usually heard as an allophone of the "sj" phoneme. These things can be hard to sort out even for native speakers with little knowledge of phonetics!

If you need professional confirmation of what I say please follow the Google link below. The top hit will lead to a list of addresses to phoneticians at Stockholm University. The direct URL has a very long and strange parameter string, so I opted for this indirect method instead!)

http://www.google.com/search?q=fonetik+site%3Ahttp%3A%2F%2Fwww2.su.se%2Fsukat

Yours,

/bpj

Date/Time: Fri Oct 22 07:14:00 CDT 2010
Contact: steffen@earthlingsoft.net
Name: Steffen Kamp
Report Type: Problems / Feedback about website
Opt Subject: Character counts wrong?

NOTE: This has been taken care of by the Editorial Committee.

Hi,

I am wondering if the character counts in the "Character Assignment Overview" of section D. of the Unicode 6.0 page are wrong: http://unicode.org/versions/Unicode6.0.0/#Character_Additions

Apart from an obvious error (the sum of BMP+Supplementary in the "Graphic" row is different from the "Total" value) I tried to reproduce the individual numbers mentioned in this table and while I got the same values for the "Totals" column, I came up with the following differences:

Graphic - BMP: 54494, Supplementary: 54748
Format - BMP: 36, Supplementary: 106
Reserved - BMP: 2459, Supplementary: 862622

Best regards, Steffen

Encoding Feedback

Date/Time: Mon Oct 4 11:57:56 CDT 2010
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/10-353 Combining Triple Diacritics

Fig. 2 plainly shows a bow above *four* letters, not three, as is confirmed by the remark "Roman capitals, lightly cut, with three ligatures, including one of *four* letters [emphasis mine]". The caption of Fig. 2 does not point this out explicitly, though it may be implied.

If the evidence for triple ligatures is scarce, the evidence for quadruple ligatures must be hens' teeth. Nevertheless, the existence of such things would tend to promote Solution B rather than adding yet more characters to Solution A to handle wider and wider bows.

Date/Time: Fri Oct 22 16:03:52 CDT 2010
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Mahajani virama L2/10-377

I see no point in encoding an invisible Mahajani virama just to handle the SH+R+I ligature. It makes more sense to me to encode SRA instead.

Since the script is really alphabetic rather than an abugida, without vowel marks, I think the A's should be omitted from the names: MAHAJANI LETTER S, MAHAJANI LETTER A, etc.

Date/Time: Fri Oct 22 16:32:48 CDT 2010
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Naxi symbols don't represent text L2/10-396

Naxi pictograms should be added to Unicode, if at all, as *symbols* used as mnemonics for important stories in Naxi culture. They do not represent words in running text; one "reads" a pictogram text by remembering the story and using the pictograms as aides-memoires.

Date/Time: Fri Oct 22 16:39:38 CDT 2010
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: English Phonotypic Alphabet L2/10-403

The fact that some EPA letters are not case-paired does not make them second-class: an EPA font must support them, whereas no font needs to support ligature characters. The rationale for encoding such letters with the ligatures is therefore unsatisfactory: they should be encoded with the rest of the EPA letters.

Closed Issues

Date/Time: Fri Sep 24 02:54:47 CDT 2010
Contact: ernestvandenboogaard@hotmail.com
Name: Ernest van den Boogaard
Report Type: Error Report
Opt Subject: UAX #9 BiDi: Summary vs content

RE UAX #9, http://www.unicode.org/reports/tr9/

Current version: version 5.2.0 (published).

Proposed version: "Unicode 6.0.0 draft2", Revision 22, Proposed Update (This revision is not published formally, and public review is closed, but it is technically available on the site as http://www.unicode.org/reports/tr9/tr9-22.html).

Statement: The Summary does not reflect the content. Severity: Low. Documentation quality affected only.

The Summary of UAX #9, in both versions, reads: "This annex describes specifications for the positioning of characters flowing from right to left, such as Arabic or Hebrew."

In my reading of the UAX, it is about combining R-to-L and R-to-L characters in one text. Before all, that is where the "Bi" in the title comes from. This way the summary does not point to the essence of the matter.

On a practical level, the summary is trivial: an implementer of Unicode working purely in Hebrew or Arabic will already do the positioning correctly natively, probably without looking at this Annex. She may even ask: "So I already know how do that, but please tell me where to look when I want to introduce Latin text in my Hebrew text".

I suggest to change the summary into something along the line of: "This annex describes specifications for the positioning of characters when both directions right-to-left and left-to-right are present in one text, such as when combining Arabic and Latin scripts." Of course the 'combining' or 'both' is the essence.

Best regards, and have a nice 6.0 Publication Day
Ernest van den Boogaard

Other Reports

Date/Time: Mon Sep 20 16:55:42 CDT 2010
Contact: allanb@thinkcomputer.com
Name: Allan Bonadio
Report Type: Error Report
Opt Subject: OCR dash vs customer acct num

There seems to be a transposition of characters U+2448 and U+2449 (see http://www.unicode.org/charts/PDF/U2440.pdf). In fact the '=' lines seem to hint at the right answer. (I've been doing ACH and reading checks so I know.)

The glyph listed for 2448 is the 'on-us' ⑈ indicator for an account number, within a particular bank. Its name is listed however as 'OCR Dash'. If you look at any check in your checkbook you can see that and also U+2446 for the routing/transit number (which is correct).

The glyph listed for 2449 is the 'OCR Dash' ⑉ character - it looks like a dash. But the name is 'OCR Customer Account Number'.

Some corroboration: http://www.barcodesoft.com/e13bmapping.htm
http://en.wikipedia.org/wiki/Magnetic_ink_character_recognition
http://www.printerm.com/fonts2C.htm
http://mindprod.com/jgloss/micr.html

FEEBACK FROM KEN WHISTLER ON THE ABOVE REPORT:

FWIW, the proximal source of the 4 characters 2446..2449 is the IBM Graphic Character Identification System. The 1988 version lists:

S0600000 Transit Symbol, MICR
S0610000 Amount Symbol, MICR
S0620000 On Us Symbol, MICR
S0630000 Dash Symbol, MICR

Those then correspond to 2446..2449, with identical glyphs.

You can see that the aliases in our name list derive from the IBM corporate names for the glyphs.

I don't know where Joe got what we used for the Unicode names for Unicode 1.0, where we ended up mixing up the names for 2448 and 2449.

This will probably require the introduction of formal aliases, because we obviously cannot fix the names, and they are actively misleading about the identity of the characters.

This report should be filed as Other Feedback and go to the UTC, so it can be used as the basis of a future decision. (I'd suggest including this background information I'm gathering here, too, so the UTC will know where these came from.)

This particular set of 4 should also get better treatment in the names list, as they are unrelated to the other OCR characters, which are a subset of OCR-A. These four are MICR symbols (Magnetic Ink Character Recognition).

http://en.wikipedia.org/wiki/Magnetic_ink_character_recognition 

And ISO 1004:1995.

What we should do is check ISO 1004:1995 (although I'm not going to pay CHF 158,00 just to take a look) and see what terms it uses. If those won't work, then I'd suggest formal aliases:

2448 MICR ON US SYMBOL
2449 MICR DASH SYMBOL

and giving a regular alias of "amount" to 2447.

---Ken

Date/Time: Sun Oct 10 10:59:08 CDT 2010
Contact: bobek@boxpl.com
Name: Michael Bobeck
Report Type: Feedback on an Encoding Proposal
Opt Subject: adding missing GREEK CAPITAL LETTER YOT

Please include in final Unicode 6.0 missing GREEK CAPITAL LETTER YOT, we have there only small GREEK LETTER YOT:

http://www.fileformat.info/info/unicode/char/3F3/index.htm 

so capital is still missing.

At least please include mapping of lowercase GREEK LETTER YOT to combining capital GREEK CAPITAL LETTER IOTA + COMBINING DOT ABOVE RIGHT

Michael Bobeck

Date/Time: Fri Oct 15 05:35:48 CDT 2010
Contact: barun_sahu@yahoo.com
Name: Barun Kumar Sahu
Report Type: Feedback on an Encoding Proposal
Opt Subject: Need of encoding blank consonant character in Indic scripts

In Indic scripts, there is a need of a blank consonant character. The character may take matras (vowel signs), including halant sign. The overline will be there for this character.

NEED: In word puzzles (such as crosswords), there are instances where the missing consonant or consonant cluster or vowel is to be filled by the player of the game. Dotted circle character (U+25CC) is unable to achieve the result. This can be achieved by a separate blank consonant character. This new character may take matra (vowel sign) and halant sign as usual to consonants and consonant clusters.

There is another need for this character: Sometimes in very long words, it is necessary to club two or three letters (consonants, consonant clusters [with matra, if any] and vowels) together within the same word. For example, the word "pashchimotthana" can be written as "pashchi mot thana" to show its pronunciation. Of course, space or hyphen characters cannot be used. We can use the blank consonant character.

Date/Time: Fri Oct 15 17:29:49 CDT 2010
Contact: andy.heninger@gmail.com
Name: Andy Heninger
Report Type: Error Report
Opt Subject: UTS 18 definition for [:word:]

UTS-18 has a standard recommendation for the property [:word:] of

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}

In a discussion with Mark Davis and Markus, Markus suggested that XID_Continue might be a better definition. They're similar.

Date/Time: Fri Oct 15 18:14:12 CDT 2010
Contact: roozbeh@htpassport.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: ScriptExtensions.txt to mention Mandaic for U+0640 ARABIC TATWEEL

UTC decided to not encode a separate kashida for Mandaic, and decided that the Arabic Tatweel should be used with it. This means that ScriptExtensions.txt need to be extended to mention Mandaic too. New line suggested:

0640 ; Arab Mand Syrc # Lm ARABIC TATWEEL

Date/Time: Fri Oct 15 19:42:57 CDT 2010
Contact: roozbeh@htpassport.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: U+06FF needs glyph or joining group changed

U+06FF ARABIC LETTER HEH WITH INVERTED V has a joining group of KNOTTED HEH (the other member of the group is the actual Knotted Heh, U+06BE ARABIC LETTER HEH DOACHASHMEE). But its reference glyph looks like the normal Heh, U+0647 ARABIC LETTER HEH. One of these needs to be fixed.

Unfortunately, the original proposal (L2/01-427) does not show the character in medial or final forms (but only initial and isolated), to help decide which way we should go. The best hint is the Heh's seen in the document's sample texts, where final forms look like knotted Heh but there are medial forms that look like normal Heh in some cases (page 6, third Arabic line, first word), and like knotted Heh in some others (page 11, Parkari portion, line 9, third word).

Considering that the original proposal asked for a KNOTTED HEH group, that a medial shape like normal Heh is probably considered an acceptable variant of knotted Heh in South Asia, and for stability reasons, I would go for keeping the joining group but fixing the glyph to use a base like U+06BE.

Date/Time: Mon Oct 18 16:51:59 CDT 2010
Contact: asmus@unicode.org
Name: (optional)
Report Type: Error Report
Opt Subject: Discrepancy re: @missing directive in Casfolding files

There's an @missing directive in the comment section of UCD files that gives default values.

There are some inconsistencies in usage. The common form is:

# @missing: <code_range>; <value>

for files where col1 contains a code or range and col2 contains a proeprty value. Instead of an actual value, a pseudo value, like <code point> is often used.

In Casefolding.txt we have

# @missing 0000..10FFFF; <codepoint>

First, this is missing the ":", one of only two files to do so. Second, this file is the only one where the <codepoint> isn't spelled <code point>. These two are nuisance differences that should be removed to make sure that simple regex searches over these files don't fail.

Third, the data lines in this file actually have three columns (ignoring the trailing comment).

0041; C; 0061; # LATIN CAPITAL LETTER A

What then, is the default value for the field in the second column? Shouldn't the default be stated as:

# @missing: 0000..10FFFF; C; <code point>

In Specialcasing.txt we have

# @missing 0000..10FFFF; <slc>; <stc>; <suc>

Again, the missing ":"

Date/Time: Fri Oct 22 11:50:10 CDT 2010
Contact: andy.heninger@gmail.com
Name: Andy Heninger
Report Type: Other Question, Problem, or Feedback
Opt Subject: Em dash line breaking oportunities in UAX #14 and Spanish

I received this via email:

from Jorge <jorge@estudiofenix.com>
to andy.heninger@gmail.com
date Thu, Oct 21, 2010 at 11:59 PM
subject Em dash line breaking oportunities in UAX #14 and Spanish Hi!

The "Unicode Standard Annex #14: Unicode Line Breaking Algorithm" mentions:

> > The em dash is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent. For example, in Swedish, spaces are used around the em dash. Line breaks can occur before and after an em dash. Because em dashes are sometimes used in pairs instead of a single quotation dash, the default behavior is not to break the line between even though not all fonts use connecting glyphs for the em dash. In Spanish it is the parenthetical block that is surrounded by spaces ―just like here― when it exists in the middle of the sentence ―you just do not close it when at the end.

(I know the use above is incorrect in English but I wanted to illustrate the use in Spanish)

With the above rule in mind, in Spanish you should **never** break the line between the em dash and the non-space character that sits next to it, exactly the opposite of what Unicode declares:

> > Break Opportunity Before and After As a result, pretty much any engine that displays Spanish text on screen (including of course any browser or ebook reader) is leaving orphan em dashes at the end of lines. No single ebook or webpage is surviving this.

A rule for English should not need to conflict with a rule for Spanish (cannot tell for other languages): em dash should only provide Break Opportunity Before and After if there are no spaces at either side. If there is one at either (which will never happen in English), the rule should be the opposite.

If there are spaces at both sides, the rule is really of no importance because then the space does provide the break opportunity at either side.

The only workaround is to manually litter all em dashes with zero width no-break spaces at both sides, which is rather gross.

Any hope this may be revised in the future (or that it is even technologically feasible for today's text engines)?

Best,

-- Jorge