One document matched: draft-ietf-iri-bidi-guidelines-02.xml


<?xml version="1.0"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY rfc2119 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">
<!ENTITY rfc3490 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3490.xml">
<!ENTITY rfc3491 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3491.xml">
<!ENTITY rfc3987 SYSTEM "http://xml.resource.org/public/rfc/bibxml/reference.RFC.3987.xml">
]>
<?rfc strict='yes'?>

<?xml-stylesheet type='text/css' href='rfc2629.css' ?>
<?xml-stylesheet type='text/xsl' href='rfc2629.xslt' ?>
<?rfc symrefs='yes'?>
<?rfc sortrefs='yes'?>
<?rfc iprnotified="no" ?>
<?rfc toc='yes'?>
<?rfc compact='yes'?>
<?rfc subcompact='no'?>
<rfc ipr="pre5378Trust200902" docName="draft-ietf-iri-bidi-guidelines-02"
  category="bcp" xml:lang="en">
  <front>
    <title abbrev="Bidi IRI Guidelines">Guidelines for Internationalized
      Resource Identifiers with Bi-directional Characters (Bidi IRIs)</title>
    <author initials="M.J." surname="Duerst" fullname="Martin Duerst">
      <!-- (Note: Please write "Duerst" with u-umlaut wherever
        possible, for example as "Dürst" in XML and HTML) -->
      <organization abbrev="Aoyama Gakuin University">Aoyama Gakuin
        University</organization>
      <address>
        <postal>
          <street>5-10-1 Fuchinobe</street>
          <city>Sagamihara</city>
          <region>Kanagawa</region>
          <code>229-8558</code>
          <country>Japan</country>
        </postal>
        <phone>+81 42 759 6329</phone>
        <facsimile>+81 42 759 6495</facsimile>
        <email>duerst@it.aoyama.ac.jp</email>
        <uri>http://www.sw.it.aoyama.ac.jp/D%C3%BCrst/<!-- (Note: This is the percent-encoded form of an IRI)--></uri>
      </address>
    </author>
    <author initials="L." surname="Masinter" fullname="Larry Masinter">
      <organization>Adobe</organization>
      <address>
        <postal>
          <street>345 Park Ave</street>
          <city>San Jose</city>
          <region>CA</region>
          <code>95110</code>
          <country>U.S.A.</country>
        </postal>
        <phone>+1-408-536-3024</phone>
        <email>masinter@adobe.com</email>
        <uri>http://larry.masinter.net</uri>
      </address>
    </author>
    <author initials="A." surname="Allawi" fullname="Adil Allawi">
      <organization>Diwan Software Limited</organization>
      <address>
        <postal>
          <street>37-39 Peckham Road</street>
          <city>London</city>
          <code>SE5 8UH</code>
          <country>United Kingdom</country>
        </postal>
        <phone>+44 7718 785850</phone>
        <facsimile>+44 20 72525444</facsimile>
        <email>adil@diwan.com</email>
        <uri>http://ironymark.diwan.com/</uri>
      </address>
    </author>
    <date year="2012" month="March" day="9"/>
    <area>Applications</area>
    <workgroup>Internationalized Resource Identifiers (iri)</workgroup>
    <keyword>IRI</keyword>
    <keyword>Internationalized Resource Identifier</keyword>
    <keyword>BIDI</keyword>
    <keyword>URI</keyword>
    <keyword>URL</keyword>
    <keyword>IDN</keyword>
    <abstract>
      <t>This specification gives guidelines for selection, use, and
        presentation of International Resource Identifiers (IRIs) which include
        characters with inherent right-to-left (rtl) writing direction. </t>
    </abstract>
  </front>
  <middle>
    <section title="Introduction">
      <t>Some UCS characters, such as those used in the Arabic and Hebrew
        scripts, have an inherent right-to-left (rtl) writing direction as
        opposed to characters, such as those in Latin scripts, that have an
        inherent left-to-right (ltr) direction. IRIs containing rtl characters
        (called bidirectional IRIs or Bidi IRIs) require additional attention
        because of the non-trivial relation between their logical and visual
        ordering. The logical order represents the order in which the characters
        are read and stored on computers. The visual order represents the order
        the characters are drawn on a computer display or printout in the way a
        human expects to read them.</t>
      <t>Generally, alphabetic characters in scripts like Arabic and Hebrew are
        drawn rtl while numbers are drawn ltr. Symbols, such as slash '/' and
        period '.' take their visual direction from the surrounding chracters.</t>
      <t>Because of this complex interaction between the logical representation,
        the visual representation, and the syntax of a Bidi IRI, a balance is
        needed between various requirements. The main requirements are: <list
        style="hanging">
        <t hangText="1.">user-predictable conversion between visual and logical
          representation;</t>
        <t hangText="2.">the ability to include a wide range of characters in
          various parts of the IRI; and</t>
        <t hangText="3.">minor or no changes or restrictions for
          implementations.</t>
        </list></t>
      <section title="Notation">
        <t>In this document, "Bidi Notation" is used for the given Bidi IRI
          examples as follows: Lower case letters a-z stand for characters that
          are written with a left to right ordering (such as Latin characters),
          whereas upper case letters A-Z represent characters that are written
          right to left (such as Arqbic or Hebrew characters). Numbers and
          symbols are the same.</t>
        <t> In this document, the key words "MUST", "MUST NOT", "REQUIRED",
          "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
          and "OPTIONAL" are to be interpreted as described in <xref
            target="RFC2119"/>.</t>
      </section>
      <!-- Notation -->
    </section>
    <!-- Introduction -->
    <section title="Logical Storage and Visual Presentation" anchor="visual">
      <t>When stored or transmitted in digital representation, Bidi IRIs MUST be
        in full logical order and MUST conform to the IRI syntax rules (which
        includes the rules relevant to their scheme). This ensures that
        Bidi IRIs can be processed in the same way as other IRIs.</t>
      <t>Bidi IRIs MUST be visually ordered by the Unicode Bidirectional
        Algorithm <xref target="UNIV6"/>, <xref target="UNI9"/>. Bidi IRIs MUST
        be rendered in the same way as they would be if they were in a
        left-to-right embedding. </t>
      <t>In conformance with the Unicode Bidirectional Algorithm, embedding MAY
        be done in one of two ways: <list style="hanging">
        <t hangText="1.">precede the IRI with U+202A, LEFT-TO-RIGHT EMBEDDING
          (LRE), and follow with U+202C, POP DIRECTIONAL FORMATTING (PDF);
          or</t>
        <t hangText="2.">use a higher-level protocol (e.g., the dir='ltr'
          attribute in HTML).</t>
        </list></t>
      <t>Preceding and following the Bidi IRI with U+200E, LEFT-TO-RIGHT MARK
        (LRM). Is NOT RECOMMENDED as, there are cases where this may not be
        sufficient to match full left to right embedding.</t>
      <t>There is no requirement to use embedding if the display is still the
        same without the embedding. For example, a Bidi IRI in a text
        with left-to-right base directionality (such as used for English or
        Cyrillic) that is preceded and followed by whitespace and strong
        left-to-right characters does not need an embedding. Also, a
        bidirectional relative IRI reference that only contains strong
        right-to-left characters and weak characters (such as symbols) and that
        starts and ends with a strong right-to-left character and appears in a
        text with right-to-left base directionality (such as used for Arabic or
        Hebrew) and is preceded and followed by whitespace and strong characters
        does not need an embedding.</t>
      <t>However, Implementers are, RECOMMENDED to use embedding in all cases
        where they are not completely sure that the display behavior is
        unaffected without the embedding.</t>
      <t>The Unicode Bidirectional Algorithm (<xref target="UNI9"/>, section
        4.3) permits higher-level protocols to influence bidirectional
        rendering. Such changes by higher-level protocols MUST NOT be used if
        they change the rendering of IRIs.</t>
      <t>The bidirectional formatting characters that may be used before or
        after the IRI to ensure correct display are not themselves part of the
        IRI. IRIs MUST NOT contain bidirectional formatting characters (LRM,
        RLM, LRE, RLE, LRO, RLO, and PDF). They affect the visual rendering of
        the IRI but do not appear themselves. It would therefore not be possible
        to input an IRI with such characters correctly.</t>
    </section>
    <!-- visual -->
    <section title="Bidi IRI Structure" anchor="bidi-structure">
      <t>The Unicode Bidirectional Algorithm is designed mainly for plain text.
        To make sure that it does not affect the rendering of Bidi IRIs outside
        of the requirements of this document, some restrictions on Bidi IRIs are
        necessary. These restrictions are given in terms of delimiters
        (structural characters, mostly punctuation such as "@", ".", ":", and
        "/") and components (usually consisting mostly of letters and
        digits).</t>
      <t>The following syntax rules from the ABNF of <xref target="RFC3987bis"/>
        correspond to components for the purpose of Bidi behavior: iuserinfo,
        ireg-name, isegment, isegment-nz, isegment-nz-nc, ireg-name, iquery, and
        ifragment.</t>
      <t>Specifications that define the syntax of any of the above components
        MAY divide them further and define smaller parts to be components
        according to this document. As an example, the restrictions of <xref
          target="RFC3490"/> on bidirectional domain names correspond to treating
        each label of a domain name as a component for schemes with ireg-name as
        a domain name. Even where the components are not defined formally, it
        may be helpful to think about some syntax in terms of components and to
        apply the relevant restrictions. For example, for the usual name/value
        syntax in query parts, it is convenient to treat each name and each
        value as a component. As another example, the extensions in a resource
        name can be treated as separate components.</t>
      <t>For each component, the following restrictions apply:</t>
      <t> <list style="hanging">
        <t hangText="1.">A component SHOULD NOT use both right-to-left and
          left-to-right characters.</t>
        <t hangText="2.">A component using right-to-left characters SHOULD start
          and end with right-to-left characters.</t>
      </list></t>
      <t>The above restrictions are given as "SHOULD"s, rather than as "MUST"s.
        For IRIs that are never presented visually, they are not relevant.
        However, for IRIs in general, they are very important to ensure
        consistent conversion between visual presentation and logical
        representation, in both directions.</t>
      <t><list style="hanging">
        <t hangText="Note:">In some components, the above restrictions may
          actually be strictly enforced. For example, <xref target="RFC3490"/>
          requires that these restrictions apply to the labels of a host name
          for those schemes where ireg-name is a host name. In some other
          components (for example, path components) following these restrictions
          may not be too difficult. For other components, such as parts of the
          query part, it may be very difficult to enforce the restrictions
          because the values of query parameters may be arbitrary character
          sequences.</t>
      </list></t>
      <t>If the above restrictions cannot be satisfied otherwise, the affected
        component can always be mapped to URI notation using the general
        percent-encoding of IRI components, as described in <xref
          target="RFC3987bis"/>. Please note that the whole component has to be
        mapped (see also Example 9 below).</t>
    </section>
    <!-- bidi-structure -->
    <section title="Input of Bidi IRIs" anchor="bidiInput">
      <t>Bidi input methods MUST generate Bidi IRIs in logical order while
        rendering them according to <xref target="visual"/>. During input,
        rendering SHOULD be updated after every new character is input to avoid
        end-user confusion.</t>
    </section>
    <!-- bidiInput -->
    <section title="Examples">
      <t>This section gives examples of Bidi IRIs in Bidi Notation. It shows
        legal IRIs with the relationship between their logical and visual
        representation and explains how certain phenomena in this relationship
        may look strange to somebody not familiar with bidirectional behavior,
        but familiar to users of Arabic and Hebrew. It also shows what happens
        if the restrictions given in <xref target="bidi-structure"/> are not
        followed. The examples below can be seen at <xref target="BidiEx"/>, in
        Arabic, Hebrew, and Bidi Notation variants.</t>
      <t>To read the bidi text in the examples, read the visual representation
        from left to right until you encounter a block of rtl text. Read the rtl
        block (including slashes and other special characters) from right to
        left, then continue at the next unread ltr character.</t>
      <t>Example 1: A single component with rtl characters is inverted:
        <vspace/>Logical representation:
        "http://ab.CDEFGH.ij/kl/mn/op.html"<vspace/>Visual representation:
        "http://ab.HGFEDC.ij/kl/mn/op.html"<vspace/>Components can be read one
        by one, and each component can be read in its natural direction.</t>
      <t>Example 2: More than one consecutive component with rtl characters is
        inverted as a whole: <vspace/>Logical representation:
        "http://ab.CDE.FGH/ij/kl/mn/op.html"<vspace/>Visual representation:
        "http://ab.HGF.EDC/ij/kl/mn/op.html"<vspace/> A sequence of rtl
        components is read rtl, in the same way as a sequence of rtl words is
        read rtl in a bidi text.</t>
      <t>Example 3: All components of an IRI (except for the scheme) are rtl.
        All rtl components are inverted overall: <vspace/>Logical
        representation: "http://AB.CD.EF/GH/IJ/KL?MN=OP;QR=ST#UV"<vspace/>Visual
        representation: "http://VU#TS=RQ;PO=NM?LK/JI/HG/FE.DC.BA"<vspace/> The
        whole IRI (except the scheme) is read rtl. Delimiters between rtl
        components stay between the respective components; delimiters between
        ltr and rtl components don't move.</t>
      <t>Example 4: Each of several sequences of rtl components is inverted on
        its own: <vspace/>Logical representation:
        "http://AB.CD.ef/gh/IJ/KL.html"<vspace/>Visual representation:
        "http://DC.BA.ef/gh/LK/JI.html"<vspace/> Each sequence of rtl components
        is read rtl, in the same way as each sequence of rtl words in an ltr
        text is read rtl.</t>
      <t>Example 5: Example 2, applied to components of different kinds:
        <vspace/>Logical representation: "http://ab.cd.EF/GH/ij/kl.html"
        <vspace/>Visual representation: "http://ab.cd.HG/FE/ij/kl.html"<vspace/>
        The inversion of the domain name label and the path component may be
        unexpected, but it is consistent with other bidi behavior. For
        reassurance that the domain component really is "ab.cd.EF", it may be
        helpful to read aloud the visual representation following the Unicode
        Bidirectional Algorithm. After "http://ab.cd." one reads the RTL block
        "E-F-slash-G-H", which corresponds to the logical representation. </t>
      <t>Example 6: Same as Example 5, with more rtl components:
        <vspace/>Logical representation:
        "http://ab.CD.EF/GH/IJ/kl.html"<vspace/>Visual representation:
        "http://ab.JI/HG/FE.DC/kl.html"<vspace/> The inversion of the domain
        name labels and the path components may be easier to identify because
        the delimiters also move.</t>
      <t>Example 7: A single rtl component includes digits: <vspace/>Logical
        representation: "http://ab.CDE123FGH.ij/kl/mn/op.html"<vspace/>Visual
        representation: "http://ab.HGF123EDC.ij/kl/mn/op.html"<vspace/> Numbers
        are written ltr in all cases but are treated as an additional embedding
        inside a run of rtl characters. This is completely consistent with usual
        bidirectional text.</t>
      <t>Example 8 (not allowed): Numbers are at the start or end of an rtl
        component:<vspace/>Logical representation:
        "http://ab.cd.ef/GH1/2IJ/KL.html"<vspace/>Visual representation:
        "http://ab.cd.ef/LK/JI1/2HG.html"<vspace/> The sequence "1/2" is
        interpreted by the Bidirectional Algorithm as a fraction, fragmenting the
        components and leading to confusion. There are other characters that are
        interpreted in a special way close to numbers; in particular, "+", "-",
        "#", "$", "%", ",", ".", and ":".</t>
      <t>Example 9 (not allowed): The numbers in the previous example are
        percent-encoded: <vspace/>Logical representation:
        "http://ab.cd.ef/GH%31/%32IJ/KL.html",<vspace/>Visual representation:
        "http://ab.cd.ef/LK/JI%32/%31HG.html"</t>
      <t>Example 10 (allowed but not recommended): <vspace/>Logical
        representation: "http://ab.CDEFGH.123/kl/mn/op.html"<vspace/>Visual
        representation: "http://ab.123.HGFEDC/kl/mn/op.html"<vspace/> Components
        consisting of only numbers are allowed (it would be rather difficult to
        prohibit them), but these may interact with adjacent RTL components in
        ways that are not easy to predict.</t>
      <t>Example 11 (allowed but not recommended): <vspace/>Logical
        representation: "http://ab.CDEFGH.123ij/kl/mn/op.html"<vspace/>Visual
        representation: "http://ab.123.HGFEDCij/kl/mn/op.html"<vspace/>
        Components consisting of numbers and left-to-right characters are
        allowed, but these may interact with adjacent RTL components in ways
        that are not easy to predict.</t>
    </section>
    <!-- examples -->
    <section title="IANA Considerations" anchor="iana">
      <t>This document makes no changes to IANA registries.</t>
    </section>
    <!-- IANA -->
    <section title="Security Considerations" anchor="security">
      <t>Confusion can occur with bidirectional IRIs, if the restrictions in
        <xref target="bidi-structure"/> are not followed. The same visual
        representation may be interpreted as different logical representations,
        and vice versa. It is also very important that a correct Unicode
        bidirectional implementation be used.</t>
    </section>
    <!-- security -->
    <section title="Acknowledgements">
      <t>This document was derived from <xref target="RFC3987"/> and <xref
        target="RFC3987bis"/> and the acknowledgments of those documents
        apply.</t>
    </section>
    <!-- acknowledgements -->
  </middle>
  <back>
    <references title="Normative References">
      <reference anchor="RFC3987bis"
        target="http://tools.ietf.org/id/draft-ietf-iri-3987bis">
        <front>
          <title>Internationalized Resource Identifiers (IRIs)</title>
          <author initials="M." surname="Duerst"/>
          <author initials="L." surname="Masinter" fullname="Larry Masinter"/>
          <author initials="M." surname="Suignard"/>
          <date year="2011" month="August" day="14"/>
        </front>
      </reference>
      <reference anchor="ASCII">
        <front>
          <title>Coded Character Set -- 7-bit American Standard Code for
            Information Interchange</title>
          <author>
            <organization>American National Standards Institute</organization>
          </author>
          <date year="1986"/>
        </front>
        <seriesInfo name="ANSI" value="X3.4"/>
      </reference>
      <reference anchor="ISO10646">
        <front>
          <title>ISO/IEC 10646:2003: Information Technology - Universal
            Multiple-Octet Coded Character Set (UCS)</title>
          <author>
            <organization>International Organization for
              Standardization</organization>
          </author>
          <date month="December" year="2003"/>
        </front>
        <seriesInfo name="ISO" value="Standard 10646"/>
      </reference> &rfc2119; &rfc3490; &rfc3491; <reference anchor="UNIV6">
        <front>
          <title>The Unicode Standard, Version 6.0.0 (Mountain View, CA, The
            Unicode Consortium, 2011, ISBN 978-1-936213-01-6)</title>
          <author>
            <organization>The Unicode Consortium</organization>
          </author>
          <date year="2010" month="October"/>
        </front>
      </reference>
      <reference anchor="UNI9"
        target="http://www.unicode.org/reports/tr9/tr9-13.html">
        <front>
          <title>The Unicode Bidirectional Algorithm</title>
          <author initials="M." surname="Davis" fullname="Mark Davis">
            <organization/>
          </author>
          <date year="2004" month="March"/>
        </front>
        <seriesInfo name="Unicode Standard Annex" value="#9"/>
      </reference>
    </references>
    <references title="Informative References">
      <reference anchor="BidiEx"
        target="http://www.w3.org/International/iri-edit/BidiExamples">
        <front>
          <title>Examples of Bidi IRIs</title>
          <author>
            <organization/>
          </author>
          <date year="" month=""/>
        </front>
      </reference> &rfc3987; </references>
  </back>
</rfc>

PAFTECH AB 2003-20262026-04-24 01:51:24