One document matched: draft-blanchet-precis-framework-00.txt
Network Working Group M. Blanchet
Internet-Draft Viagenie
Obsoletes: 3454 (if approved) July 5, 2010
Intended status: Standards Track
Expires: January 6, 2011
Precis Framework: Handling Internationalized Strings in Protocols
draft-blanchet-precis-framework-00.txt
Abstract
Using Unicode codepoints in protocol strings requires preparation of
the string. This document describes the Precis Protocol Framework
that prepares various classes of strings used in protocol elements.
A protocol specification chooses a class of strings and then
implements the corresponding preparation steps described in this
document. This document is based on the IDNAbis approach. It
obsoletes the Stringprep algorithm.
Status of this Memo
This Internet-Draft is submitted in full conformance with the
provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF). Note that other groups may also distribute
working documents as Internet-Drafts. The list of current Internet-
Drafts is at http://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
This Internet-Draft will expire on January 6, 2011.
Copyright Notice
Copyright (c) 2010 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
Blanchet Expires January 6, 2011 [Page 1]
Internet-Draft Precis Framework July 2010
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
This document may contain material from IETF Documents or IETF
Contributions published or made publicly available before November
10, 2008. The person(s) controlling the copyright in some of this
material may not have granted the IETF Trust the right to allow
modifications of such material outside the IETF Standards Process.
Without obtaining an adequate license from the person(s) controlling
the copyright in such materials, this document may not be modified
outside the IETF Standards Process, and derivative works of it may
not be created outside the IETF Standards Process, except to format
it for publication as an RFC or to translate it into languages other
than English.
Blanchet Expires January 6, 2011 [Page 2]
Internet-Draft Precis Framework July 2010
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. String Classes . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Domain U-Label, A-Label and Name . . . . . . . . . . . . . . . 5
4. Email Addresses . . . . . . . . . . . . . . . . . . . . . . . 5
5. Restricted Identifier . . . . . . . . . . . . . . . . . . . . 5
6. Less-Restrictive Identifier . . . . . . . . . . . . . . . . . 5
7. Normalization Form and Case Folding . . . . . . . . . . . . . 5
8. Codepoint Properties . . . . . . . . . . . . . . . . . . . . . 5
9. Category definitions Used to Calculate Derived Property
Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
9.1. LetterDigits (A) . . . . . . . . . . . . . . . . . . . . . 7
9.2. Unstable (B) . . . . . . . . . . . . . . . . . . . . . . . 8
9.3. IgnorableProperties (C) . . . . . . . . . . . . . . . . . 8
9.4. IgnorableBlocks (D) . . . . . . . . . . . . . . . . . . . 8
9.5. LDH (E) . . . . . . . . . . . . . . . . . . . . . . . . . 9
9.6. Exceptions (F) . . . . . . . . . . . . . . . . . . . . . . 9
9.7. BackwardCompatible (G) . . . . . . . . . . . . . . . . . . 10
9.8. JoinControl (H) . . . . . . . . . . . . . . . . . . . . . 10
9.9. OldHangulJamo (I) . . . . . . . . . . . . . . . . . . . . 11
9.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . . 11
10. Calculation of the Derived Property . . . . . . . . . . . . . 11
11. Codepoints . . . . . . . . . . . . . . . . . . . . . . . . . . 12
12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 12
12.1. IDNA derived property value registry . . . . . . . . . . . 12
12.2. IDNA Context Registry . . . . . . . . . . . . . . . . . . 12
12.2.1. Template for context registry . . . . . . . . . . . . 13
13. Security Considerations . . . . . . . . . . . . . . . . . . . 13
14. Discussion home for this draft . . . . . . . . . . . . . . . . 13
15. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 13
Appendix A. Contextual Rules Registry . . . . . . . . . . . . . 13
Appendix A.1. ZERO WIDTH NON-JOINER . . . . . . . . . . . . . . . 16
Appendix A.2. ZERO WIDTH JOINER . . . . . . . . . . . . . . . . . 16
Appendix A.3. MIDDLE DOT . . . . . . . . . . . . . . . . . . . . . 16
Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA) . . . . . . . . . 17
Appendix A.5. HEBREW PUNCTUATION GERESH . . . . . . . . . . . . . 17
Appendix A.6. HEBREW PUNCTUATION GERSHAYIM . . . . . . . . . . . . 17
Appendix A.7. KATAKANA MIDDLE DOT . . . . . . . . . . . . . . . . 18
Appendix A.8. ARABIC-INDIC DIGITS . . . . . . . . . . . . . . . . 18
Appendix A.9. EXTENDED ARABIC-INDIC DIGITS . . . . . . . . . . . . 18
Appendix B. Codepoints 0x0000 - 0x10FFFF . . . . . . . . . . . . 19
Appendix B.1. Codepoints in Unicode Character Database (UCD)
format . . . . . . . . . . . . . . . . . . . . . . . 19
16. Informative References . . . . . . . . . . . . . . . . . . . . 19
Author's Address . . . . . . . . . . . . . . . . . . . . . . . . . 20
Blanchet Expires January 6, 2011 [Page 3]
Internet-Draft Precis Framework July 2010
1. Introduction
[draft-ietf-blanchet-newprep-problem-statement] describes the
rationale behind updating Stringprep[RFC3454] to a new framework.
Current Stringprep profiles and their corresponding protocol
specifications share similar class of strings. This framework is
based on the assumption that the use of internationalized strings in
most protocols can be grouped into a few set of string classes. By
defining a few string classes and their corresponding preparation
algorithms instead of specific profiles for each protocol,
o protocols specifications do not need to have a special i18n
section or implementation, since they would reference one of this
document string classes and corresponding processing.
o protocols benefit for sharing implementation code and tables.
o end-users will have a better knowledge of which codepoints are
allowed in various contexts (instead of a specific profile per
protocol as of with Stringprep profiles
o versioning for future versions of the Unicode database is simpler
o protocols that have familiarity with others (such as username
identifiers used in various authentication schemes in protocols)
can use the same string class, and therefore obtain consistency
for end-users and implementors.
This framework takes heavily on the IDNAbis tables[IDNABISTABLES],
therefore, could help implementors by sharing common code for all
string classes, including domain labels and names.
EDITOR NOTE:This current version of the document copy a lot of
normative text from draft-ietf-idnabis-tables. The editor would
highly prefer reference instead of copy, but at least for the purpose
of discussion, copied text. Moreover, the idnabis-table draft
contains references to IDN labels in many places which may make
problematic for normative reference. To be looked at as we go.
2. String Classes
The following classes of strings are identified:
o domain U-label
o domain A-label
o domain name
o email address
o restricted identifier
o less-restrictive identifier
Blanchet Expires January 6, 2011 [Page 4]
Internet-Draft Precis Framework July 2010
3. Domain U-Label, A-Label and Name
TBD:define the class.
For these string classes, implement [IDNA2008].
4. Email Addresses
TBD:define the class by instantiating and refering to the EAI, SMTP.
For this classes of strings, implement [EAI]?
5. Restricted Identifier
This class of strings, named RI in this document, corresponds to an
identifier which contains language-type characters, no spacing
characters, no "@", no "punctuation", no display characters. The
normative description of this class is in the corresponding mapping
tables.
In section XX below, allowed Unicode codepoints for this string class
are identified as PVALID or RI_PVALID. Disallowed codepoints are
identified as DISALLOWED or RI_DISALLOWED.
6. Less-Restrictive Identifier
This class of strings, named LRI in this document, corresponds to an
identifier which contains language-type characters, no spacing
characters, no "@", but contains various "punctuation" and display
characters. The normative description of this class is in the
corresponding mapping tables.
In section XX below, allowed Unicode codepoints for this string class
are identified as PVALID or LRI_PVALID. Disallowed codepoints are
identified as DISALLOWED or LRI_DISALLOWED.
7. Normalization Form and Case Folding
TBD: discuss NFC vs NFKC, case folding",
8. Codepoint Properties
This document reviews and classifies the collections of code points
Blanchet Expires January 6, 2011 [Page 5]
Internet-Draft Precis Framework July 2010
in the Unicode character set by examining various properties of the
code points. It then defines an algorithm for determining a derived
property value. It specifies a procedure, and not a table, of code
points so that the algorithm can be used to determine code point sets
independent of the version of Unicode that is in use.
This document is not intended to specify precisely how these property
values are to be applied in protocol strings. That information
should be defined in the protocol specification that instantiate a
string class of this document.
The value of the property is to be interpreted as follows.
o PROTOCOL VALID: Those that are allowed to be used in any string
class. Code points with this property value are permitted for
general use in any string class. The abbreviated term PVALID is
used to refer to this value in the rest of this document.
o SPECIFIC CLASS PROTOCOL VALID: Those that are allowed to be used
in specific string classes. Code points with this property value
are permitted for use in specific string classes. The abbreviated
term *_PVALID, where * = (RI, LRI) is used to refer to this value
in the rest of this document.
o CONTEXTUAL RULE REQUIRED: Some characteristics of the character,
such as it being invisible in certain contexts or problematic in
others, requires that it not be used in labels unless specific
other characters or properties are present. The abbreviated term
CONTEXT is used to refer to this value in the rest of this
document. There are two subdivisions of CONTEXTUAL RULE REQUIRED,
one for Join_controls (called CONTEXTJ) and for other characters
(called CONTEXTO).
o DISALLOWED: Those that should clearly not be included in any
string class. Code points with this property value are not
permitted in any string class.
o SPECIFIC CLASS DISALLOWED: Those that should clearly not be
included in specific string classes. Code points with this
property value are not permitted in any string class. The
abbreviated term *_DISALLOWED, where * = (RI, LRI) is used to
refer to this value in the rest of this document.
o UNASSIGNED: Those code points that are not designated (i.e. are
unassigned) in the Unicode Standard.
The mechanisms described here allow determination of the value of the
property for future versions of Unicode (including characters added
after Unicode 5.2). Changes in Unicode properties that do not affect
the outcome of this process do not affect this framework. For
example, a character can have its Unicode General_Category value (see
[Unicode52]) change from So to Sm, or from Lo to Ll, without
affecting the algorithm results. Moreover, even if such changes were
Blanchet Expires January 6, 2011 [Page 6]
Internet-Draft Precis Framework July 2010
to result, the BackwardCompatible list (Section 9.7) can be adjusted
to ensure the stability of the results.
Some code points need to be allowed in exceptional circumstances, but
should be excluded in all other cases; these rules are also described
in other documents. The most notable of these are the Join Control
characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH NON-
JOINER. Both of them have the derived property value CONTEXTJ. A
character with the derived property value CONTEXTJ or CONTEXTO
(CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
rule has been established and the context of the character is
consistent with that rule. It is invalid to either register a string
containing these characters or even to look one up unless such
contextual rule is found and satisfied. Please see Appendix A, The
Contextual Rules Registry, for more information.
9. Category definitions Used to Calculate Derived Property Value
The derived property obtains its value based on a two-step procedure.
First, characters are placed in one or more character categories
based on either core properties defined by the Unicode Standard or by
treating the codepoint as an exception and addressing the codepoint
by its codepoint value. These categories are not mutually exclusive.
In the second step, set operations are used with these categories to
determine the values for an string class specific property. Those
operations are specified in Section 10.
Unicode property names and property value names may have short
abbreviations, such as gc for the General_Category property, and Ll
for the Lowercase_Letter property value of the gc property.
In the following specification of categories, the operation which
returns the value of a particular Unicode character property for a
code point is designated by using the formal name of that property
(from PropertyAliases.txt) followed by '(cp)'. For example, the
value of the General_Category property for a code point is indicated
by General_Category(cp).
9.1. LetterDigits (A)
A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}
These rules identifies characters commonly used in mnemonics and
often informally described as "language characters".
For more information, see section 4.5 of [Unicode5].
Blanchet Expires January 6, 2011 [Page 7]
Internet-Draft Precis Framework July 2010
The categories used in this rule are:
o Ll - Lowercase_Letter
o Lu - Uppercase_Letter
o Lo - Other_Letter
o Nd - Decimal_Number
o Lm - Modifier_Letter
o Mn - Nonspacing_Mark
o Mc - Spacing_Mark
9.2. Unstable (B)
B: toNFKC(toCaseFold(toNFKC(cp))) != cp
This category is used to group the characters that are not stable
under NFKC normalization and casefolding. In general, these code
points are not suitable for use in any string class.
The toCaseFold() operation is defined in Section 3.13 of [Unicode5].
The toNFKC() operation returns the code point in normalization form
KC. For more information, see Section 5 of [TR15].
9.3. IgnorableProperties (C)
C: Default_Ignorable_Code_Point(cp) = True or
White_Space(cp) = True or
Noncharacter_Code_Point(cp) = True
This category is used to group code points that are not recommended
for use in identifiers. In general, these code points are not
suitable for identifiers.
The definition for Default_Ignorable_Code_Point can be found in
DerivedCoreProperties.txt [1] and is at the time of Unicode 5.2:
Other_Default_Ignorable_Code_Point + Cf (Format characters)
+ Variation_Selector - White_Space - FFF9..FFFB (Annotation
Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters
that should be visible)
9.4. IgnorableBlocks (D)
D: Block(cp) is in {Combining Diacritical Marks for Symbols,
Musical Symbols, Ancient Greek Musical Notation}
This category is used to identifying code points that are not useful
in mnemonics but may be useful for some string classes.
Blanchet Expires January 6, 2011 [Page 8]
Internet-Draft Precis Framework July 2010
The definition of blocks can be found in Blocks.txt [2]
9.5. LDH (E)
E: cp is in {002D, 0030..0039, 0061..007A}
This category is used in the second step to preserve the traditional
"hostname" (LDH) characters ('-', 0-9 and a-z). In general, these
code points are suitable for use for identifiers.
9.6. Exceptions (F)
F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007,
302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B,
30FB}
This category explicitly lists code points for which the category
cannot be assigned using only the core property values that exist in
the Unicode standard. The values are according to the table below:
PVALID -- Would otherwise have been DISALLOWED
00DF; PVALID # LATIN SMALL LETTER SHARP S
03C2; PVALID # GREEK SMALL LETTER FINAL SIGMA
06FD; PVALID # ARABIC SIGN SINDHI AMPERSAND
06FE; PVALID # ARABIC SIGN SINDHI POSTPOSITION MEN
0F0B; PVALID # TIBETAN MARK INTERSYLLABIC TSHEG
3007; PVALID # IDEOGRAPHIC NUMBER ZERO
CONTEXTO -- Would otherwise have been DISALLOWED
00B7; CONTEXTO # MIDDLE DOT
0375; CONTEXTO # GREEK LOWER NUMERAL SIGN (KERAIA)
05F3; CONTEXTO # HEBREW PUNCTUATION GERESH
05F4; CONTEXTO # HEBREW PUNCTUATION GERSHAYIM
30FB; CONTEXTO # KATAKANA MIDDLE DOT
CONTEXTO -- Would otherwise have been PVALID
0660; CONTEXTO # ARABIC-INDIC DIGIT ZERO
0661; CONTEXTO # ARABIC-INDIC DIGIT ONE
0662; CONTEXTO # ARABIC-INDIC DIGIT TWO
0663; CONTEXTO # ARABIC-INDIC DIGIT THREE
0664; CONTEXTO # ARABIC-INDIC DIGIT FOUR
0665; CONTEXTO # ARABIC-INDIC DIGIT FIVE
Blanchet Expires January 6, 2011 [Page 9]
Internet-Draft Precis Framework July 2010
0666; CONTEXTO # ARABIC-INDIC DIGIT SIX
0667; CONTEXTO # ARABIC-INDIC DIGIT SEVEN
0668; CONTEXTO # ARABIC-INDIC DIGIT EIGHT
0669; CONTEXTO # ARABIC-INDIC DIGIT NINE
06F0; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ZERO
06F1; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT ONE
06F2; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT TWO
06F3; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT THREE
06F4; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FOUR
06F5; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT FIVE
06F6; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SIX
06F7; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT SEVEN
06F8; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT EIGHT
06F9; CONTEXTO # EXTENDED ARABIC-INDIC DIGIT NINE
DISALLOWED -- Would otherwise have been PVALID
0640; DISALLOWED # ARABIC TATWEEL
07FA; DISALLOWED # NKO LAJANYALAN
302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
3031; DISALLOWED # VERTICAL KANA REPEAT MARK
3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF
3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF
303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK
9.7. BackwardCompatible (G)
G: cp is in {}
This category includes the code points that property values in
versions of Unicode after 5.2 have changed in such a way that the
derived property value would no longer be PVALID or DISALLOWED. If
changes are made to future versions of Unicode so that code points
might change property value from PVALID or DISALLOWED, then this
table can be updated and keep special exception values so that the
property values for code points stay stable.
9.8. JoinControl (H)
H: Join_Control(cp) = True
This category consists of Join Control characters (i.e., they are not
in LetterDigits (Section 9.1)) but are still required in strings
under some circumstances.
Blanchet Expires January 6, 2011 [Page 10]
Internet-Draft Precis Framework July 2010
9.9. OldHangulJamo (I)
I: Hangul_Syllable_Type(cp) is in {L, V, T}
This category consists of all conjoining Hangul Jamo (Leading Jamo,
Vowel Jamo, and Trailing Jamo).
Elimination of conjoining Hangul Jamos from the set of PVALID
characters results in restricting the set of Korean PVALID characters
just to preformed, modern Hangul syllable characters. Old Hangul
syllables, which must be spelled with sequences of conjoining Hangul
Jamos, are not PVALID for string classes.
9.10. Unassigned (J)
J: General_Category(cp) is in {Cn} and
Noncharacter_Code_Point(cp) = False
This category consists of code points in the Unicode character set
that are not (yet) assigned. It should be noted that Unicode
distinguishes between 'unassigned code points' and 'unassigned
characters'. The unassigned code points are all but (Cn -
Noncharacters), while the unassigned *characters* are all but (Cn +
Cs).
10. Calculation of the Derived Property
Possible values of the property are:
o PVALID
o RI_PVALID
o LRI_PVALID
o CONTEXTJ
o CONTEXTO
o DISALLOWED
o RI_DISALLOWED
o LRI_DISALLOWED
o UNASSIGNED
The algorithm to calculate the value of the derived property is as
follows. If the names of a rule (such as Exception) is used, that
implies the set of codepoints that the rule define, while the same
name as a function call (such as Exception(cp)) imply the value cp
has in the Exceptions table.
If .cp. .in. Exceptions Then Exceptions(cp);
Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp);
Blanchet Expires January 6, 2011 [Page 11]
Internet-Draft Precis Framework July 2010
Else If .cp. .in. Unassigned Then UNASSIGNED;
Else If .cp. .in. LDH Then PVALID;
Else If .cp. .in. JoinControl Then CONTEXTJ;
Else If .cp. .in. Unstable Then DISALLOWED;
Else If .cp. .in. IgnorableProperties Then DISALLOWED;
Else If .cp. .in. IgnorableBlocks Then LRI_PVALID;
Else If .cp. .in. OldHangulJamo Then DISALLOWED;
Else If .cp. .in. LetterDigits Then PVALID;
Else DISALLOWED;
11. Codepoints
The Categories and Rules defined in Section 9 and Section 10 apply to
all Unicode code points. The table in Appendix B shows, for
illustrative purposes, the consequences of the categories and
classification rules, and the resulting property values.
The list of code points that can be found in Appendix B is non-
normative. Section 9 and Section 10 are normative.
12. IANA Considerations
12.1. IDNA derived property value registry
IANA is to create a registry with the derived properties for the
versions of Unicode that is released after (and including) version
5.2. The derived property value is to be calculated in cooperation
with a designated expert[RFC5226] according to the specifications in
Section 9 and Section 10 and not by copying the non-normative table
found in Appendix B.
If during this process (creation of the table of derived property
values) followed by a designated expert review, either non-backward
compatible changes to the table of derived properties are discovered,
or otherwise problems during the creation of the table arises, that
is to be flagged to the IESG. Changes to the rules (as specified in
Section 9 and Section 10), including BackwardCompatible (Section 9.7)
(a set that is at release of this document is empty), require IETF
Review, as described in [RFC 5226].
12.2. IDNA Context Registry
For characters that are defined in IDNA derived property value
registry (Section 12.1) as CONTEXTO or CONTEXTJ and therefore
requiring a contextual rule IANA will create and maintain a list of
approved contextual rules. Additions or changes to these rules
Blanchet Expires January 6, 2011 [Page 12]
Internet-Draft Precis Framework July 2010
require IETF Review, as described in [RFC5226].
A table from which that registry can be initialized, and some further
discussion appears in Appendix A.
12.2.1. Template for context registry
The following information is to be given when a new rule is created.
Name: Unique name of the rule
Code point: Rule should be applied when this codepoint exist in
label
Overview: Description in plain english on what the rule verifies
Lookup: Should rule be applied at time of lookup?
Rule Set: The set of rules, as described in
13. Security Considerations
TBD
14. Discussion home for this draft
This document is discussed in the precis@ietf.org mailing list (This
section to be removed when published as RFC).
15. Acknowledgements
The author of this document would like to acknowledge the comments
and contributions of the following people: ...
Since this document copies a lot of text and the algorithms from
IDNAbis tables, therefore all authors and contributors to the idnabis
work are deeply acknowledged.
Appendix A. Contextual Rules Registry
As discussed in Section 12.2, a registry of rules that define the
contexts in which particular PROTOCOL-VALID characters, characters
associated with a requirement for Contextual Information, are
permitted. These rules are expressed as tests on the label in which
the characters appear (all, or any part of, the label may be tested).
The grammatical rules are expressed in pseudo code. The conventions
used for that pseudo code are explained here.
Blanchet Expires January 6, 2011 [Page 13]
Internet-Draft Precis Framework July 2010
Each rule is constructed as a Boolean expression that evaluates to
either True or False. A simple "True;" or "False;" rule sets the
default result value for the rule set. Subsequent conditional rules
that evaluate to True or False may re-set the result value.
A special value "Undefined" is used to deal with any error
conditions, such as an attempt to test a character before the start
of a label or after the end of a label. If any term of a rule
evaluates to Undefined, further evaluation of the rule immediately
terminates, as the result value of the rule will itself be Undefined.
cp represents the codepoint to be tested.
FirstChar is a special term which denotes the first codepoint in a
string.
LastChar is a special term which denotes the last codepoint in a
string.
.eq. represents the equality relation.
A .eq. B evaluates to True if A equals B.
.is. represents checking position in a string.
A .is. B evaluates to True if A and B have same position in
the same string.
.ne. represents the non-equality relation.
A .ne. B evaluates to True if A is not equal to B.
.in. represents the set inclusion relation.
A .in. B evaluates to True if A is a member of the set B.
A functional notation, Function_Name(cp), is used to express either
string positions within a string, Boolean character property tests of
a codepoint, or a regular expression match. When such function names
refer to Boolean character property tests, the function names use the
exact Unicode character property name for the property in question,
and "cp" is evaluated as the Unicode value of the codepoint to be
tested, rather than as its position in the string. When such
function names refer to string positions within a string, "cp" is
evaluated as its position in the string.
RegExpMatch(X) takes as its parameter X a schematic regular
Blanchet Expires January 6, 2011 [Page 14]
Internet-Draft Precis Framework July 2010
expression consisting of a mix of Unicode character property values
and literal Unicode codepoints.
Script(cp) returns the value of the Unicode Script property, as
defined in Scripts.txt in the Unicode Character Database.
Canonical_Combining_Class(cp) returns the value of the Unicode
Canonical_Combining_Class property, as defined in UnicodeData.txt in
the Unicode Character Database.
Before(cp) returns the codepoint of the character immediately
preceding cp in logical order in the string representing the string.
Before(FirstChar) evaluates to Undefined.
After(cp) returns the codepoint of the character immediately
following cp in logical order in the string representing the string.
After(LastChar) evaluates to Undefined.
Note that "Before" and "After" do not refer to the visual display
order of the character in a string, which may be reversed or
otherwise modified by the bidirectional algorithm for strings
including characters from scripts written right-to-left. Instead,
'Before' and 'After' refer to the network order of the character in
the string.
The clauses "Then True" and "Then False" imply exit from the pseudo-
code routine with the corresponding result.
Repeated evaluation for all characters in a string makes use of the
special construct:
For All Characters:
Expression;
End For;
This construct requires repeated evaluation of "Expression" for each
codepoint in the string, starting from FirstChar and proceeding to
LastChar.
The different fields in the rules are to be interpreted as follows:
Code point:
The codepoint, or codepoints, that this rule is to be applied to.
Normally, this implies that if any of the codepoints in a string
is as defined, then the rules should be applied. If evaluated to
True, the codepoint is ok as used; if evaluated to False, it is
not o.k.
Blanchet Expires January 6, 2011 [Page 15]
Internet-Draft Precis Framework July 2010
Overview:
A description of the goal with the rule, in plain English.
Lookup:
True if application of this rule is recommended at lookup time;
False otherwise.
Rule Set:
The rule set itself, as described above.
Appendix A.1. ZERO WIDTH NON-JOINER
Code point:
U+200C
Overview:
This may occur in a formally cursive script (such as Arabic) in a
context where it breaks a cursive connection as required for
orthographic rules, as in the Persian language, for example. It
also may occur in Indic scripts in a consonant conjunct context
(immediately following a virama), to control required display of
such conjuncts.
Lookup:
True
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C
(Joining_Type:T)*(Joining_Type:{R,D})) Then True;
Appendix A.2. ZERO WIDTH JOINER
Code point:
U+200D
Overview:
This may occur in Indic scripts in a consonant conjunct context
(immediately following a virama), to control required display of
such conjuncts.
Lookup:
True
Rule Set:
False;
If Canonical_Combining_Class(Before(cp)) .eq. Virama Then True;
Appendix A.3. MIDDLE DOT
Code point:
U+00B7
Blanchet Expires January 6, 2011 [Page 16]
Internet-Draft Precis Framework July 2010
Overview:
Between 'l' (U+006C) characters only, used to permit the Catalan
character ela geminada to be expressed
Lookup:
False
Rule Set:
False;
If Before(cp) .eq. U+006C And
After(cp) .eq. U+006C Then True;
Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA)
Code point:
U+0375
Overview:
The script of the following character MUST be Greek.
Lookup:
False
Rule Set:
False;
If Script(After(cp)) .eq. Greek Then True;
Appendix A.5. HEBREW PUNCTUATION GERESH
Code point:
U+05F3
Overview:
The script of the preceding character MUST be Hebrew.
Lookup:
False
Rule Set:
False;
If Script(Before(cp)) .eq. Hebrew Then True;
Appendix A.6. HEBREW PUNCTUATION GERSHAYIM
Code point:
U+05F4
Overview:
The script of the preceding character MUST be Hebrew.
Lookup:
False
Rule Set:
False;
Blanchet Expires January 6, 2011 [Page 17]
Internet-Draft Precis Framework July 2010
If Script(Before(cp)) .eq. Hebrew Then True;
Appendix A.7. KATAKANA MIDDLE DOT
Code point:
U+30FB
Overview:
Note that the Script of Katakana Middle Dot is not any of
"Hiragana", "Katakana" or "Han". The effect of this rule is to
require at least one character in the label to be in one of those
scripts.
Lookup:
False
Rule Set:
False;
For All Characters:
If Script(cp) .in. {Hiragana, Katakana, Han} Then True;
End For;
Appendix A.8. ARABIC-INDIC DIGITS
Code point:
0660..0669
Overview:
Can not be mixed with Extended Arabic-Indic Digits.
Lookup:
False
Rule Set:
True;
For All Characters:
If cp .in. 06F0..06F9 Then False;
End For;
Appendix A.9. EXTENDED ARABIC-INDIC DIGITS
Code point:
06F0..06F9
Overview:
Can not be mixed with Arabic-Indic Digits.
Lookup:
False
Rule Set:
True;
For All Characters:
Blanchet Expires January 6, 2011 [Page 18]
Internet-Draft Precis Framework July 2010
If cp .in. 0660..0669 Then False;
End For;
Appendix B. Codepoints 0x0000 - 0x10FFFF
If one applies the rules (Section 10) to the code points 0x0000 to
0x10FFFF to Unicode 5.2, the result is as follows.
This list is non-normative, and only included for illustrative
purposes. Specifically, what is displayed in the third column is not
the formal name of the codepoint (as defined in section 4.8 of
[Unicode52]). The differences exists for example for the codepoints
that have the codepoint value as part of the name (example: CJK
UNIFIED IDEOGRAPH-4E00) and the naming of Hangul syllables. For many
codepoints, what you see is the official name.
Appendix B.1. Codepoints in Unicode Character Database (UCD) format
0000..10FFFF; TBD!
16. Informative References
[RFC3454] Hoffman, P. and M. Blanchet, "Preparation of
Internationalized Strings ("stringprep")", RFC 3454,
December 2002.
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003.
[RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
Profile for Internationalized Domain Names (IDN)",
RFC 3491, March 2003.
[RFC3492] Costello, A., "Punycode: A Bootstring encoding of Unicode
for Internationalized Domain Names in Applications
(IDNA)", RFC 3492, March 2003.
[RFC3722] Bakke, M., "String Profile for Internet Small Computer
Systems Interface (iSCSI) Names", RFC 3722, April 2004.
[RFC3920] Saint-Andre, P., Ed., "Extensible Messaging and Presence
Protocol (XMPP): Core", RFC 3920, October 2004.
[RFC4011] Waldbusser, S., Saperia, J., and T. Hongal, "Policy Based
Management MIB", RFC 4011, March 2005.
Blanchet Expires January 6, 2011 [Page 19]
Internet-Draft Precis Framework July 2010
[RFC4013] Zeilenga, K., "SASLprep: Stringprep Profile for User Names
and Passwords", RFC 4013, February 2005.
[RFC4505] Zeilenga, K., "Anonymous Simple Authentication and
Security Layer (SASL) Mechanism", RFC 4505, June 2006.
[RFC4518] Zeilenga, K., "Lightweight Directory Access Protocol
(LDAP): Internationalized String Preparation", RFC 4518,
June 2006.
[RFC4690] Klensin, J., Faltstrom, P., Karp, C., and IAB, "Review and
Recommendations for Internationalized Domain Names
(IDNs)", RFC 4690, September 2006.
[1] <http://unicode.org/Public/UNIDATA/DerivedCoreProperties.txt>
[2] <http://unicode.org/Public/UNIDATA/Blocks.txt>
Author's Address
Marc Blanchet
Viagenie
2600 boul. Laurier, suite 625
Quebec, QC G1V 4W1
Canada
Email: Marc.Blanchet@viagenie.ca
URI: http://www.viagenie.ca
Blanchet Expires January 6, 2011 [Page 20]
| PAFTECH AB 2003-2026 | 2026-04-24 02:41:56 |