One document matched: draft-masinter-url-i18n-01.txt
Differences from draft-masinter-url-i18n-00.txt
Using UTF8 for non-ASCII Characters in Extended URIs
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working
documents of the Internet Engineering Task Force (IETF), its areas,
and its working groups. Note that other groups may also distribute
working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-Drafts
as reference material or to cite them other than as ``work in
progress.''
To learn the current status of any Internet-Draft, please check the
``1id-abstracts.txt'' listing contained in the Internet-Drafts
Shadow Directories on ftp.is.co.za (Africa), nic.nordu.net
(Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East
Coast), or ftp.isi.edu (US West Coast).
This document is not a product of any working group, but may
be discussed on the mailing list url-i18n@unicode.org.
Abstract
URIs are defined as sequences of characters chosen from a limited
subset of the repertoire of ASCII characters, both for transmission
in network protocols and representation in spoken and written human
communication.
This document defines a uniform way of representing non-ASCII
scripts in URIs and in an Extended URI, so these identifiers can be
used for the world's languages.
1. Introduction
URIs [RFC-URI-SYNTAX] are defined as sequences of characters chosen
from a limited subset of the repertoire of ASCII characters. The
characters in URIs are frequently used for representing English
words and phrases; unfortunately, this leaves out most of the
world, who do not write merely with the letters A-Z.
2. Syntax
This memo defines two ways of represting non-ASCII characters
within URIs:
1) Within traditional URIs: To be compatible with [RFC-URI-SYNTAX],
non-ASCII characters SHOULD be transcribed in URIs by first
representing the characters with the UTF-8 character encoding
[RFC-UTF8], and then using the hex-encoding defined in
[RFC-URI-SYNTAX] to encode any octet that does not correspond to
an allowed, non-reserved character.
2) Within a new object, an 8-bit URIs (8URI): for a more compact
and natural representation, an 8URI consists of a sequence of
octets in the UTF-8 encoding; all characters are represented
directly by their UTF-8 encoding, except those disallowed in
[RFC-URI-SYNTAX] (reserved, delimiters, white space, unwise
special characters), which MUST be hex-encoded.
Any octet sequence which would likely yield ambiguous or incorrect
results when printed or displayed and then subsequently typed by a
user SHOULD be hex-encoded. (See [RFC-DUERST] for details.)
3. Software Requirements
Supporting URIs for non-ASCII characters requires cooperation from
the providers of three different components of URI software:
3.1 Requirements for URI entry
One component of software that deals with URIs allows users to type
in the URIs. A human transcribes a visual representation of a URI
(as a sequence of glyphs, in some order, in some visual display)
using some entry method that will result in a URI.
If the visual representation contains only those characters that
are allowed [RFC-URI-SYNTAX] standard syntax of URIs, the
transcription is simple. However, for all other sequences of
characters, it is desirable that the entry results in characters,
in logical order from the ISO 10646 character repertoire, encoded
using the UTF-8 method [RFC 2044], and then subsequently encoded as
necessary using the URI hex-encoding (the set of octets that
require encoding depending on whether the result is a URI or an
8URI).
Care must be taken in the identification of the characters and
character sequence: all accented characters should be translated
into their combined form, no extraneous BIDI (bidirectional) marks
should be left in the resulting stream, and that characters that
are intended to represent Western European letters should be
transcribed into their ISO-8859-1 equivalents and not, for example,
as double-wide characters. See [RFC-DUERST] for more complete
rules.
3.2 Requirements for URI generation and interpretation
Systems that are offering resources through the Internet, where
those resources have logical names, sometimes offer the ability to
generate URIs for the resources they offer. For example, some HTTP
servers offer the ability to generate a 'directory listing' for
file directories under their purvue, and then to respond to the
generated URIs with the files. If the names of the files consist
solely of US-ASCII characters the transcription is simple, but
other file systems offer a wider variety of characters. For maximum
interoperability, the generation of directories SHOULD be
in UTF-8, and the results hex-encoded as appropriate for the
URI or 8URI.
This requirement applies to HTTP servers, FTP servers, gopher
servers, and the like.
3.3 Requirements for display of URIs
Software that displays URIs to users (or any other kind of
transcription, e.g., deciding what to print in a magazine) should
follow a general principle: "Don't display a URI that the viewer
wouldn't be able to type!" The consequences of this principle
require judgement about the availability of software that
implements the character input method described in section 3.1.
a) In situations where most viewers would not have the capability
of typing non-ASCII characters, any octet not allowed in the
[RFC-URI-SYNTAX] definition of URIs SHOULD be displayed as if it
were hex-encoded.
b) In situations where the viewer is likely to have software for
non-ASCII character entry as described in section 3.1, sequences
of octets MAY be displayed directly as the non-ASCII character
sequence it represents in UTF-8. In addition, character
sequences of %HH-encoding which correspond to non-ASCII
characters MAY be displayed directly, just show the encoding in
ASCII, OR may be displayed as if it were a sequence of
hex-encoded UTF-8.
3.4 Requirements for interpretation of URIs
Software that interprets URIs as the names of local resources
SHOULD accept multiple renditions of the URIs in the case where
those resources names might have non-ASCII representations.
Just as allowing case-insensitive file names makes URIs more
robust, because the person viewing the URI might type the
case differently than it is displayed, similarly, URI-interpreting
software should be generous in allowing all of the possible
representations that might result from the recommendations in
section 3.1. In addition, it is useful if unaccented characters
are accepted, when possible, as aliases for accented characters,
and that other equivalences are made.
Summary
These recommendations, when taken together, will allow for the
extension of URIs to handle scripts other than ASCII while minimizing
interoperability problems.
Acknowledgements
Many thanks to Martin Duerst and others for help with this draft.
References
[RFC 2044]
[RFC-URI-SYNTAX] draft-fielding-url-syntax
[RFC-DUERST] draft-duerst-url-???
| PAFTECH AB 2003-2026 | 2026-04-24 12:00:10 |