[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: IDN WG Last Call on two major changes to Stringprep
"Mark Davis" <mark@macchiato.com> writes:
> There are two issues. (a) rationale, (b) NFKC interaction.
>
> For the rationale, I include at the bottom an old email.
Thanks.
(If people want to look at more discussions, I found an archive at
<URL:http://www.imc.org/ietf-bidi/entire-arch.txt>.)
> It proposed a somewhat more complicated solution than eventually
> appeared in the text; I think what is in StringPrep is better (just
> for simplicity).
Somehow I wish this was discussed in the specification, if people are
supposed to understand it without digging in mailing lists archives.
The fact that bidi is difficult to understand (as is normalization and
other IDN topics) isn't an excuse to be sparse with details, rationale
and examples; on the contrary.
> For the NFKC interaction, you bring up a good point. The conditions on
> the string, if the goal is to always have a consistent appearance
> (both on keyboard entry, and when displaying a name fetched from the
> server) should be in effect for the string both before and after
> StringPrep. In practice, it only has any effect in a few isolated
> cases, since there are very few characters whose BIDI class changes.
What could a solution be? Perform the bidi step before NFKC too? If
the intention is to not ignore the issue, that is.
>
>
> Mark
>
> ----- Original Message -----
> From: "Mark Davis" <mark@macchiato.com>
> To: <ietf-bidi@imc.org>; "Paul Hoffman / IMC" <phoffman@imc.org>
> Sent: Saturday, September 15, 2001 19:03
> Subject: Re: First attempt at problem statement
>
>
> BIDI IDN
>
> I will try to recap some of the discussions in the ad hoc on BIDI IDN
> last week.
>
> The BIDI algorithm is designed to deal with normal text. Within any
> string, sequences of LTR (left to right) and RTL characters will
> always appear in the correct order. However, the order of other
> characters (such as a period) will depend on their context. For
> details, see http://www.unicode.org/reports/tr9/.
>
> URLs are not normal text, and thus may have odd display. This is
> complicated by the fact that the overall paragraph direction has an
> effect on the display. Whether the URL is displayed in a RTL or LTR
> context will change the order of the components. [In this and other
> examples we will use the convention that uppercase letters stand for
> right-to-left characters: Arabic, Hebrew, etc.] Example:
>
> Memory: http://SOME.LARGE.mixed.CORP.org
> Display (LTR): http://EGRAL.EMOS.mixed.PROC.org
> Display (RTL): org.PROC.mixed.EGRAL.EMOS//:http
>
> Notice that "SOME.LARGE" always appears from RTL: the period adopts
> the order of the surrounding characters. Characters on boundaries
> (such as "//:" take on the overall display direction.
>
> For example:
>
> (1) characters in different fields may mix across fields:
>
> Memory: http://SOME.veryLARGE.CORP.org
> Display (LTR): http://EMOS.veryPROC.EGRAL.org
> Display (RTL): org.PROC.EGRALvery.EMOS//:http
>
>
> (2) two different sequences of characters in the same field can have
> the same order when displayed. Thus a user would not know how to type
> a URL that he sees printed.
>
> Memory1: http://123CORP.org
> Memory2: http://CORP123.org
> Display (LTR): http://123PROC.org
>
>
> The following are proposed requirements.
>
> 1. Consistent fields.
>
> (a) If you have fields such as http://XXX.YYY.ZZZ, characters from the
> same field should not be displayed in different fields and vice versa.
> (b) It should be possible to deduce the order of the backing-store
> fields from the order of the display fields.
>
> 2. Order within fields.
>
> Within a field, tt should be possible to deduce the order of the
> backing-store characters from the order of the display fields.
>
> 3. No algorithm change
>
> This should require no change to BIDI algorithm. Using a separate
> algorithm for display of URLs would be difficult, since they are found
> within flowing text. Getting people to update the BIDI algorithm would
> also be quite difficult (and changes to make URLs work might have
> repercussions on other text).
>
> 4. Simplicity
>
> Whatever solution we have should have a simple algorithm (according to
> Paul).
>
> Ideally, some reasonable restrictions on the contents of a field would
> meet all of these requirements.
>
>
> * * *
>
> During the meeting, I thought that the most straightforward method was
> to force the periods to be LTR. However, after looking at the results
> in both a RTL and LTR context, I concluded that Mati's approach would
> be better. The results both in terms of field order and order within a
> field would still be determinate. With any complete URL (with http://,
> ftp://, etc.) it would be easy to recognize the order in context,
> since the position of those initial letters would show the order. The
> only bad case would be where the end of the string reversed because of
> its surroundings, e.g.:
>
> Memory: the url http://SOME.LARGE.mixed.CORP.org is the one
> Display (LTR): the url http://EGRAL.EMOS.mixed.PROC.org is the one
> Display (RTL): org is the one.PROC.mixed.EGRAL.EMOS//: the url http
>
> Thus in flowing text, users would be recommended to bracket any BIDI
> URL with RLM (or embed the URL). E.g.
>
> Memory: the url <RLM>http://SOME.LARGE.mixed.CORP.org<RLM> is
> the one
> Display (LTR): the url http://EGRAL.EMOS.mixed.PROC.org is the one
> Display (RTL): is the one org.PROC.mixed.EGRAL.EMOS//:http the url
>
> Software, such as email clients, that recognizes URLs could do this
> automatically.
>
>
> Here is what I have now for an explicit algorithm to be added as a
> step to NamePrep, after the regular Prohibition step.
>
> A. Characters are classified into RTL, LTR, DIGIT, OTHER.
>
> These categories are drawn from the BIDI algorithm. The precise lists
> of characters in each category would be added to NamePrep as an
> appendix. The composition is as follows (See
> http://www.unicode.org/reports/tr9/#Bidirectional_Character_Types).
>
> LTR := L ; # including LRM
>
> RTL := R | AL ;
>
> DIG := EN | AN ;
>
> OTH := all other characters: NSM, ON, etc.
>
> Note: The characters in categories LRM, RLM, LRO, RLO, LRE, RLE, PDF,
> B, S, and some other BIDI categories are prohibited anyway.
>
>
> B. In any field that contains any RTL characters:
> B0. no LTR characters can occur.
> C1. a sequence of characters of type DIG can only occur at the end.
> C2. a sequence of characters of type OTHER can occur only between
> characters of type RTL.
>
>
> The following is an example of an algorithm that implements (B). EOS
> stands for "end of string".
>
> 1. Let S be 0
> 2. Get the next character, then get its numeric type T
> 3. Let S be Map[S, T]
> 4. If S = T or F, exit with OK or FAIL respectively
> 5. Goto Step 2
>
> Map is defined by the following table:
>
> T LTR RTL DIG OTH EOS
> S +---------------------
> B | L, R, L, L, F // begin
> L | L, F, L, L, T // left
> R | F, R, D, O, T // right
> O | F, R, F, O, F // right + other
> D | F, F, D, F, T // right + digit
>
>
> Mark
> __________
> http://www.macchiato.com
> ◄ “Eppur si muove” ►
>
> ----- Original Message -----
> From: "Simon Josefsson" <jas@extundo.com>
> To: <paul.hoffman@imc.org>; <Marc.Blanchet@viagenie.qc.ca>
> Cc: "IETF/IDN WG" <idn@ops.ietf.org>
> Sent: Friday, July 26, 2002 20:28
> Subject: [idn] Re: IDN WG Last Call on two major changes to Stringprep
>
>
>> Quoting the draft:
>>
>> ,----
>> | In any profile that specifies bidirectional character handling,
> all
>> | three of the following requirements MUST be met:
>> ...
>> | 2) If a string contains any Right-to-Left character (defined as
>> | belonging to Unicode bidirectional categories "R" and "AL"), the
> string
>> | MUST NOT contain any Left-to-Right character (defined as belonging
> to
>> | Unicode bidirectional category "L").
>> |
>> | 3) If a string contains any Right-to-Left character (as defined
> above),
>> | a Right-to-Left character MUST be the first character of the
> string, and
>> | a Right-to-Left character MUST be the last character of the
> string.
>> `----
>>
>> There is little rationale for the last two requirements. Without
>> knowing the rationale, it is difficult to understand how to
> implement
>> this, not to speak of understanding and evaluating the
> specification.
>>
>> It is not difficult to construct various strings that violates these
>> requirements, but seem like valid identifiers to me (e.g., U+05D0
>> U+0966, contemplate it being written by a mathematically inclined
>> writer in India). Why is U+05D0 a R/AL character but U+2135 not?
>> U+2135 is NFKC'd into U+05D0. It thus seems like the identifier is
> a
>> valid IDN if NFKC is not used, but if NFKC is used, it is not a
> valid
>> identifier. A bidi user thus seem to require NFKC not to be used in
>> order to have the bidi string accepted.
>>
>>
>>