This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Tue Nov 22 18:20:36 CST 2011
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UTR#36 (Unicode security) 3.7.1 (PEP 383 Approach) error
The UTR#36 document (for lossless conversion to Unicode of other encodings) says that PEP 383 uses the code points <0xD800 + byte value> for any unmappable byte of a source encoding to map them to low surrogates. However PEP 383 actually uses (for its "unicodeescape" encoding) the code points <0xDC00 + byte value>, i.e. high surrogates (with the advantage that it is easier to detect them when converting back to the original encoding, without having to look forward in the string, when the generated Unicode string uses 16-bit code units, to see if it is followed by a high surrogate representing a valid non-BMP character. In its current implementation however, not all unmapped characters are converted like this: if the source encoding is not based on ASCII (that is always convertible to Unicode), the current Python implementation of PEP 383 generates exceptions rather than converting these bytes from 0x00..0x7F to 0xDC00..0xDCFF, but in fact the PEP383 approach is not required to do this. The PEP 383 approach is usable independantly of the size of code units through which the code points are represented, including if the Unicode string uses 8-bit code units (i.e. this is still a valid Unicode string, at the code point level, but this is not a valid UTF-8). But for this case, it would generate 3 bytes in the 8-bit Unicode string for each unmapped byte of the original encoding, and a more efficient but similar approach could as well map them in two bytes: - <0xC0, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC00..0xDC3F that themselves represent the unmapped source bytes 0x00..0x3F; - <0xC1, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC40..0xDC7F that themselves represent the unmapped source bytes 0x40..0x7F; - <0xC2, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDC80..0xDCBF that themselves represent the unmapped source bytes 0x80..0xBF; - <0xC3, 0x80 + (byte & 0x3F)> to represent these specially mapped isolated surrogate code points 0xDCC0..0xDCFF that themselves represent the unmapped source bytes 0xC0..0xFF; Nothing would be changed to PEP383 if the generated Unicode string uses 16-bit or 32-bit code units. In all cases, the Unicode string will still enumerate the same number and values of code units at the Python programmatic level. (This approach is similar to the approach used in Java for (representing the NULL codepoint as <0xC0, 0x80> to allow lossless (representation of valid Unicode strings, which will be internally (represented as 16-bit code units at the programmatic Java level, but (as 8-bit code units at the legacy JNI 8-bit interface or in network (serialisations and for strings in compiled Java classes recognized by (the class loader).
Date/Time: Wed Feb 22 21:55:41 CST 2012
Name: Shriramana Sharma
Report Type: Other Question, Problem, or Feedback
Opt Subject: Telugu confusables
Note: This was already sent to the editorial committee.
I notice in the latest meeting minutes: A.5.2 Action item review. [130-A1] Action Item for Lisa Moore: Follow up with Andhra Pradesh on action 125-A17. [130-A2] Action Item for Eric Muller: Take info for Indic TR and turn into a document for the doc register. Where 125-A17 is: South Asian Subcommittee — TELUGU LENGTH MARK (D.3.1) [125-A17] Action Item for Manoj Jain: Work with Andhra Pradesh Gov't to determine what additional clarifications and annotations may be required for the Telugu script. L2/10-339 [125-A18] Action Item for Eric Muller, Julie Allen, Editorial Committee: Look for cases to be added to the confusable vowel representation tables in the Indic chapter(s) for Unicode 6.0. Look at document L2/10-339 Telugu, and other cases where documentation could be improved. Since I was the one who submitted the document L2/10-339 requesting deprecation of Telugu Length Mark, let me just give the list of confusables I had in mind. VS-II ీ = VS-I ి + LM ౕ VS-EE ే = VS-E ె + LM ౕ VS-OO ో = VS-O ొ + LM ౕ HA హ VS-AA ా -> HAA హా = HA హ LM ౕ (VS = vowel sign; LM = length mark) The people with the Action Item can incorporate this into what they write. [Submitted via the form as per offlist suggestion of Markus Scherer to ensure it doesn't get forgotten.]