Accumulated Feedback on PRI #509

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Dec 06 21:31:48 CST 2024
ReportID: ID20241206213148
Name: Dennis Tan
Report Type: Public Review Issue
Opt Subject: 509


In section 3 (Link Detection Algorithm), subsection "Initiation", the document uses the following 
reference for "top-level domains" https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains. 
Perhaps it would be better suited to use a more authoritative source and one that is updated regularly — 
e.g., https://www.iana.org/domains/root/db. The wiki page doesn't even list the IDN top-level domains.

Date/Time: Wed Dec 18 02:41:23 CST 2024
ReportID: ID20241218024123
Name: Hank Nussbacher
Report Type: Public Review Issue
Opt Subject: 509


In section 7 - https://www.unicode.org/reports/tr58/#test-data - might i suggest that you include 
test data for bidirectional content for Linkification - like from Arabic or Hebrew?

Date/Time: Mon Jan 20 08:04:59 CST 2025
ReportID: ID20250120080459
Name: Arnt Gulbrandsen
Report Type: Public Review Issue
Opt Subject: 509


Hi,

I have compared the UTS58 draft with a few linkifiers.

One omission I saw is that another linkifier tolerates and ignores U+00AD (soft hyphen). The commit 
message is terser than terse, but hints that someone sends text of the form "foo example.com/foo/bar 
bar" with a soft hyphen after the full stop and/or slashes.

It's not clear to me that this is worth bothering with. Your call.

Date/Time: Mon Feb 10 11:17:13 CST 2025
ReportID: ID20250210111713
Name: Jules Bertholet
Report Type: Public Review Issue
Opt Subject: 509


In addition to being used ⸢like⸣ ⸤this⸥,
the half brackets, and possibly also the half parentheses,
can also be used ⸢like⸥ ⸤this⸣
(imitating the East Asian corner brackets).
The Link_Paired_Opener property should be array-valued to reflect this.

Date/Time: Wed Mar 26 05:59:34 CDT 2025
ReportID: ID20250326055934
Name: Ebrahim Byagowi
Report Type: Public Review Issue
Opt Subject: 509


I like to provide a quick drive by comment about https://www.unicode.org/reports/tr58/tr58-2.html

I think the lack of standard recommendation on how URLs should actually be displayed has caused 
https://issuetracker.google.com/issues/40665886 and essentially breaking Persian text in URL 
bars and misrendering of Emoji skin tones using ZWJ in Chrome as described on the tracker.

The same situation exists in Safari but URLs are hidden there for the most part but if one tries 
to edit a URL containing ZWNJ things go wrong by double encoding already encoded ZWNJ in the URL 
like https://phabricator.wikimedia.org/F58924232 (unfortunately this isn't always reproducible 
in Safari but is annoying enough and comes from the same root ZWNJ being displayed by its code)

I'll understand that you may consider these as browsers bugs but after seeing 
https://www.unicode.org/reports/tr58/tr58-2.html and the lengthy discussion I had in Chromium's bug 
tracker which made the developers sure understand what is going on https://issuetracker.google.com/issues/40665886 
I felt if it were some official recommendation things could go more smoothly.

Date/Time: Fri Mar 28 11:53:15 CDT 2025
ReportID: ID20250328115315
Name: cketti
Report Type: Public Review Issue
Opt Subject: 509


Step 4.7. of the termination algorithm currently reads "If LT == Open", but it should be "If LT == Close". (Step 4.6. handles "LT == Open")

Date/Time: Mon Apr 07 06:31:16 CDT 2025
ReportID: ID20250407063116
Name: Arnt Gulbrandsen
Report Type: Public Review Issue
Opt Subject: 509


Hi,

I ran across a bug today that I think points out a relevant problem in UTS58: A user expected 普遍适用测试。我爱你 to be linkified as
<a href="https://普遍适用测试.我爱你">普遍适用测试。我爱你</a> (note the changing dot).

Chrome and some other web browsers map "。" to "." in domains when you hit enter after typing/pasting into the address bar. I do feel 
that at least U+06D4 and U+3002 ought to be mapped to the ASCII dot in UTS58 since it's such a common mistake. ("。" and "." are even 
on the same key on the Chinese keyboards I've seen.)

I mention U+06D4 and U+3002 because I've seen those mistakenly used in "domain names" in the course of my work. U+FF61 and others 
might also be used mistakenly in theory, but I haven't seen that.


Feedback above this line has already been reviewed during UTC #183 in April, 2025.

Date/Time: Mon Sep 15 14:07:47 PST 2025
ReportID: ID20250915140747
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: suggestions for Summary

Editorial suggest for the intro summary of DUTS 58:

---

URLs processed in communication protocols are parsed in conformance with particular protocol specifications. When URLs appear in 
text content, however, the character sequences are not always intended to be read in exactly the same way they would be parsed in 
a protocol. Some characters that are often used as sentence-level punctuation in text can also be valid characters within a URL. 
Software that applies the protocol rules when parsing URLs in text content often produce the wrong results.

When a URL is inserted into text, percent encoding can be used to avoid the above ambiguity, though often this is not done. Also, 
when a URL that includes non-ASCII characters is inserted into text, implementations often over-use percent encoding for those 
characters, resulting in a URL that is illegible for a human reader. Thus, percent encoding is often both underused and overused 
leading to less beneficial results.

This document specifies...

Date/Time: Mon Sep 15 17:27:18 PST 2025
ReportID: ID20250915172718
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: paired punctuation within path, etc.

In the Link_Termination Property section, the description for Close mentions "subparts" within a path, query or fragment:

"If the character is paired with a previous character in the same Part (path, query, fragment) and in the same subpart 
(that is, not across interior '/' in a path, or across '&' or '=' in a query, it is treated as Include."

(It might also have mentioned "/" and "?" within query and fragment but doesn't.)

Two issues:

First, is seems to make sense that counterpart open/close punctuation is unlikely to be used with an intent of being paired 
across segment boundaries within a path. So, not searching for a pair across a "/" boundary within a path seems to make sense. 
However, it seems less obvious that the same can be said for query or fragment elements of a URL. For example, it seems 
conceivable that a "(" ... ")" pair might surround a sequence of key/value pairs that comprise a logical grouping. E.g.,

...(k1=a&k2=b)...

I have no idea if this kind of pairing is done in practice, so perhaps this isn't more than a remote, hypothetical possibility. 
But I do know it is permissible in RFC 3986.

But -- the second point -- the algorithm, as written, does not include anything to recognize such "subpart" segments or to 
incorporate awareness of such subparts into the logic.

Thus, unless there is thought to elaborate the algorithm in some way to support these subparts, it doesn't make sense to mention 
them at all.

Date/Time: Mon Sep 15 18:15:59 PST 2025
ReportID: ID20250915181559
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: PRI 509: pairing across path/query/frag boundaries

Given potential ambiguity of closers like ")" at the end of a candidate URL sequences, it makes sense not to attempt to infer 
a pairing across the boundaries between the top-level elements of a URL: e.g., opener in a path paired with a closer in query. 
The algorithm, as it is intended to be implemented, ensures this by having the steps within the Link-Detection Algorithm section 
applied separately, and in turn, to path, query and fragments elements.

However, there is potential for implementations to overlook that detail and to apply the steps in that section to a sequence 
spanning path + query + fragment. The intent for the steps to be applied to those elements separately is not stated in the 
Link-Detection Algorithm section, so if an implementer hasn't read and paid adequate attention to earlier sections, they might 
overlook that very important detail.

Partially related is that the terms "part" and "Part" are used inconsistently through the document. So, for instance, use of "part" 
in section 3 "Parts of a URL" includes references to protocol and port elements.

The first mention of this crucial Part-at-a-time logic is in passing, in the second paragraph of the Termination section:

"The key is to be able to determine, given a Part (such as query)..."

The next is buried in the description of Close:

"If the character is paired with a previous character *in the same Part (path query fragment)* ..."

Of course, it is the handling of potential open/close pairs for which the Part-at-a-time logic matters the most. But the doc 
should call this out more clearly.

The clearest statements coming in the Termination Algorithm section:

"This algorithm then processes each final Part [path, query, fragment] of the URL in turn. ... A Link_Termination=Close character... 
that does **not** have a matching Open character *in the same Part* of the URL."

If we assume implementers read and pay attention to these sections, that *should* be adequate. Even so, to be most explicit, it makes 
sense to call out this detail directly within the Link-Detection Algorithm section. The following is a suggestion:

## Link-Detection Algorithm

The following steps are performed, logically, over path, query and fragment elements of a URL separately. The are applied first over 
the path, if present. If no termination is detected within the path, the steps are repeated for the query, if present, with variables 
reset. If no termination is detected within the query, the steps are repeated for the fragment, if present. Crucially, the openStack 
must be cleared on the transitions from path to query and to fragment.

In the following:

* Part refers to one of {path, query, fragment}.

...

====

As noted earlier, usage of "Part" and "part" is not consistent. E.g.,

- one of {path, query, fragment} — the intended meaning

- other main URL elements: e.g., protocol, host, port (top of section 3); also in section 4, "Process each Part up to the Path, Query, 
and Fragment in the normal fashion..."

- individual characters: e.g., "... when a trailing period should be counted as part of a link or not."

- positions within the URL: e.g., "... the last location in the current Part that is still safely considered part of the link."

- a span of URL characters: e.g., "... vs. URLs that contain a part that is enclosed in parens, etc."


Also inconsistent is casing when referring to path, query and fragment (some instances are capitalized), and in references to the three 
elements as a set---"... Part (path, query, fragment)" vs. "... Part [path, query, fragment]".