Accumulated Feedback on PRI #208

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

 


Feedback on previous drafts of the review document is listed below

Date/Time: Tue Nov 22 18:20:36 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTR#36 (Unicode security) 3.7.1 (PEP 383 Approach) error


The UTR#36 document (for lossless conversion to Unicode of other
encodings) says that PEP 383 uses the code points <0xD800 + byte
value> for any unmappable byte of a source encoding to map them to low
surrogates.

However PEP 383 actually uses (for its "unicodeescape" encoding) the
code points <0xDC00 + byte value>, i.e. high surrogates (with the
advantage that it is easier to detect them when converting back to the
original encoding, without having to look forward in the string, when
the generated Unicode string uses 16-bit code units, to see if it is
followed by a high surrogate representing a valid non-BMP character.

In its current implementation however, not all unmapped characters are
converted like this: if the source encoding is not based on ASCII
(that is always convertible to Unicode), the current Python
implementation of PEP 383 generates exceptions rather than converting
these bytes from 0x00..0x7F to 0xDC00..0xDCFF, but in fact the PEP383
approach is not required to do this.

The PEP 383 approach is usable independantly of the size of code units
through which the code points are represented, including if the
Unicode string uses 8-bit code units (i.e. this is still a valid
Unicode string, at the code point level, but this is not a valid
UTF-8).

But for this case, it would generate 3 bytes in the 8-bit Unicode
string for each unmapped byte of the original encoding, and a more
efficient but similar approach could as well map them in two bytes:

- <0xC0, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC00..0xDC3F that themselves represent the unmapped source bytes 0x00..0x3F;

- <0xC1, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC40..0xDC7F that themselves represent the unmapped source bytes 0x40..0x7F;

- <0xC2, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDC80..0xDCBF that themselves represent the unmapped source bytes 0x80..0xBF;

- <0xC3, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate 
code points 0xDCC0..0xDCFF that themselves represent the unmapped source bytes 0xC0..0xFF;

Nothing would be changed to PEP383 if the generated Unicode string uses 16-bit or 
32-bit code units. In all cases, the Unicode string will still enumerate the same 
number and values of code units at the Python programmatic level.

(This approach is similar to the approach used in Java for
(representing the NULL codepoint as <0xC0, 0x80> to allow lossless
(representation of valid Unicode strings, which will be internally
(represented as 16-bit code units at the programmatic Java level, but
(as 8-bit code units at the legacy JNI 8-bit interface or in network
(serialisations and for strings in compiled Java classes recognized by
(the class loader).


Date/Time: Wed Feb 22 21:55:41 CST 2012
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Telugu confusables

Note: This was already sent to the editorial committee.


I notice in the latest meeting minutes:

A.5.2 Action item review.

[130-A1] Action Item for Lisa Moore: Follow up with Andhra Pradesh
on action 125-A17.

[130-A2] Action Item for Eric Muller: Take info for Indic TR and turn
into a document for the doc register.

Where 125-A17 is:

South Asian Subcommittee — TELUGU LENGTH MARK (D.3.1)

[125-A17] Action Item for Manoj Jain: Work with Andhra Pradesh Gov't to
determine what additional clarifications and annotations may be required
for the Telugu script. L2/10-339

[125-A18] Action Item for Eric Muller, Julie Allen, Editorial Committee:
Look for cases to be added to the confusable vowel representation tables
in the Indic chapter(s) for Unicode 6.0. Look at document L2/10-339 Telugu,
and other cases where documentation could be improved.

Since I was the one who submitted the document L2/10-339 requesting
deprecation of Telugu Length Mark, let me just give the list of confusables
I had in mind.  
VS-II  ీ = VS-I ి + LM ౕ 
VS-EE  ే = VS-E ె + LM ౕ
VS-OO ో = VS-O  ొ + LM ౕ
HA హ VS-AA ా -> HAA హా = HA హ LM ౕ

(VS = vowel sign; LM = length mark)

The people with the Action Item can incorporate this into what they write.

[Submitted via the form as per offlist suggestion of Markus Scherer to
ensure it doesn't get forgotten.]