[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] case folding
At 10:42 11/06/00 , GIM Gyeongseog-KIM Kyongsok wrote:
>On Sun, 11 Jun 2000, RJ Atkinson wrote:
>
> > At 04:08 11/06/00 , GIM Gyeongseog-KIM Kyongsok wrote:
> > The above idea breaks other Romanised languages, such
> > as Vietnamese, so I think its really not possible to adopt.
>
>i don't know much about vietnamese.
>could you please give one or two concrete examples?
I did previously on this IDN WG list. Here is another try,
this time with more background detail...
Background:
- Vietnamese only uses Romanised letters, and never uses
ideogramatic characters. Although spoken Vietnamese
has some cognates with Mandarin or Cantonese (due to
roughly 1000 years of Chinese imperialism, ask anyone
who is Vietnamese), it is distinctly a different language
from Chinese and never uses Chinese characters in its
written form. (e.g. the word "ma~" in Vietnamese is
pronounced the same as the Mandarin word for "horse",
but has a more narrow meaning of "Chess piece named horse",
whereas a different Vietnamese word is used for the
"animal horse".
- Vietnamese uses 6 different diacritical tone marks
(roughly: accent acute ', accent grave `, horizontal
squiggle ~, vertical sqiggle, dot underneath, no mark)
and at least one non-tonal diacritical vowel modifier
(circumflex). In the northern pronunciation, there are
6 distinct tones in regular spoken use, while in the south
the tones (horizontal squiggle ~, vertical squiggle) tend
to be blurred together in spoken form. The several Vietnamese
pronunciations are mutually intelligible, perhaps more so
than Yorkshire English and Southern US English are)
- Vietnamese also has some letters unique to Vietnamese
(e.g. there is a "D-" character which is different from,
yet very slightly similar to the Nordic letter Eth).
- All Vietnamese letters are recognisably Roman, not
in any way ideograms.
- The Vietnamese language has been normally written in this
Romanised (Quo^`c Ngu) form for more than 200 years and
it is this form that is used everywhere (signs, newspapers,
other places). If one goes back to perhaps the Roman year
1400, then Vietnamese were using an ideogrammatic written form
derived from Chinese characters, but that was never widely
used outside the educated elite and has been "dead" for
more than 200 years.
One could imagine a URL:
http://www.d-o^ng.vn
and its capitalised equal:
http://WWW.D-O^NG.VN
It would be broken for users if those two URLs above did not
go to the same web page. If we only case-map [a-z, A-Z],
then we would not case-map "d-" with "D-" or case-map "o^"
with "O^", thus making the 2 fictional URLs above map to
different web pages.
It is not reasonable to say that the content provider needs to
register all domain-names with the myriad case combinations
and manually map them to the same content, though I can see
that some would-be DNS registries would very much like the
chance to charge N times for the same domain-name.
My bottom line is that we MUST do case-mapping for at least
all of the Romanised letters. Again, I am not discussing
any CJK issues in this email.
Ran
rja@inet.org