One document matched: draft-duerst-query-i18n-00.txt
Internet Draft M. Duerst
<draft-duerst-query-i18n-00.txt> University of Zurich
Expires January 1998 July 1997
Handling Internationalized Query Components in URLs
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are working doc-
uments of the Internet Engineering Task Force (IETF), its areas, and
its working groups. Note that other groups may also distribute work-
ing documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months. Internet-Drafts may be updated, replaced, or obsoleted by
other documents at any time. It is not appropriate to use Internet-
Drafts as reference material or to cite them other than as a "working
draft" or "work in progress".
To learn the current status of any Internet-Draft, please check the
1id-abstracts.txt listing contained in the Internet-Drafts Shadow
Directories on ds.internic.net (US East Coast), nic.nordu.net
(Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific
Rim).
Distribution of this document is unlimited. Please send comments to
the author at <mduerst@ifi.unizh.ch> or to the uri mailing list at
uri@bunyip.com. This document is currently a pre-draft, for
restricted discussion only. It is intended to become part of a suite
of documents related to the internationalization of URLs.
Abstract
HTTP and HTML provide the facility to query the user and return the
results. This is usually done in the query component of an URL. This
mechanisms works with full satisfaction for characters of the us-
ascii repertoire. Due to the lack of an agreed encoding for other
characters, the situation is much less satisfactory for characters
outside the us-ascii repertoire.
This document makes two contributions to the problem: (1) It
describes an application convention mostly already respected, and
sufficient in many cases. (2) It introduces an addition to HTTP to
ease the transition to a general internationalized URL architecture.
Expires End of January 1998 [Page 1]
Internet Draft Internationalized Query Components July 1997
Table of contents
1. Introduction ................................................... 2
1.1 General ......................................................2
1.2 Terms ........................................................3
2. A Simple Application Convention for Browsers ....................3
4. Upgrading of Query Component to UTF-8 ...........................4
3.1 The Query-UTF-8 Request/Response-Header Field ................4
3.2 Rationale ....................................................5
Bibliography .......................................................6
Author's Address ...................................................7
1. Introduction
1.1 General
HTTP (HyperText Transfer Protocol [HTTP1.1]) and HTML (HyperText
Markup Language [HTML4.0]) provide the facility to query the user
(with a FORM in HTML) and return the results to the server. There are
various ways to return the result (see in particular [Fileupload]),
but the one most widely used is to encode the result in the query
component of an URL [RFC1738, URLsyntax]. This mechanisms work with
full satisfaction for characters of the us-ascii repertoire. Due to
the lack of an agreed encoding for other characters, the situation is
much less satisfactory for characters outside the us-ascii reper-
toire.
Ideally, the problem would be solved by agreeing on a single charac-
ter encoding for all query parts or all URLs. The outstanding candi-
date for this is UTF-8 [RFC2044]. UTF-8 is already the preferred
encoding for new URL schemes [URLprocess], the only encoding for a
recently defined URL scheme [IMAPURL], the encoding on the wire for
beyond-ASCII FTP filenames [FTPINT] (thus making it the encoding for
the ftp: URL scheme) and the encoding suggested for the Internet in
general [RFC2130]. UTF-8 has various important properties, in par-
ticular that it is completely compatible with US-ASCII and is easily
detectable by simple heuristics.
Moving to UTF-8 for URLs is most difficult for the query component.
This is due to the fact that for the other components, in particular
for the path component, the namespace is very sparse and well known
to the server, while it is dense and not well known in the case of
the query part. To increase the reliability of transmitting query
information, this document describes an existing convention and
Expires End of January 1998 [Page 2]
Internet Draft Internationalized Query Components July 1997
proposes some new protocol element for HTTP.
1.2 Terms
This section contains definitions and explanations for some terms
that may otherwise not be clear.
- Accept-Charset attribute: An HTML attribute, proposed in [RFC2070]
and taken up in HTML 4.0 [HTML4.0]. Please note that this is not
the same as the Accept-Charset request-header field in HTTP.
Please also note that the Accept-Charset attribute is on INPUT and
TEXTAREA in RFC 2070, but on FORM in HTML 4.0. The HTML 4.0 syntax
is preferred, and assumed in this document.
- CGI Script: In the context of this document, a placeholder for any
kind of functional component used to process a response to a
query.
- Character Encoding: A mapping from an octet sequence to a sequence
of characters. Misleadingly called "character set" in some IETF
documents [RFC2045]. Denoted by the value of the "charset" pra-
mater, with values from the corresponding IANA registry [IANA].
- Transcoding Server/Proxy: A HTTP Server or Proxy which transcodes
the documents it serves, to respond to an "Accept-Charset" HTTP
request header field.
- Transcoding: The act of changing the character encoding of a docu-
ment, while not changing it otherwise (the length of the document
may be affected).
2. A Simple Application Convention for Browsers
This section spells out an application convention that is in use in
most current and older browsers, although it is not followed, or not
completely followded, by all browsers, and that can be implemented
easily.
The convention is that a user agent should send back the results of a
query in exactly the same character encoding as the character encod-
ing of the document that contained the FORM, as received by the user
agent.
Expires End of January 1998 [Page 3]
Internet Draft Internationalized Query Components July 1997
The advantage of this application convention is that it works nicely
for documents and CGI scripts that are assuming a single character
encoding. In the plain case, neither the server nor the CGI script
have to do any special processing such as trying to detect the char-
acter encoding of the query component or transcode the query compo-
nent.
This application convention fails if the document has been transcoded
by a transcoding proxy. The query compontent is sent back in the
character encoding requested by the user agent, which is the target
character encoding of the transcoding undergone at the proxy. The
query component sent back to the server, however, must not be changed
by the proxy (see [HTTP1.1]).
3. Upgrading of Query Component to UTF-8
For those parts of an URL that originate at the server, in particular
for the path component, the introduction of UTF-8 [RFC2044] as the
encoding of choice can be made on a per-server or per-resource base.
Because the name space of the path component is usually very sparsely
populated, it is even possible to accept URLs with path components in
different character encodings for the same resource.
The query component of an URL, however, is in most cases generated
independently in the user agent, and the namespace can be very
densely populated. To upgrade it to UTF-8 therefore requires addi-
tional provisions.
Here, we propose to add a single header field to HTTP. The header
field is used both as a request header field and as a response header
field.
3.1 The Query-UTF-8 Request/Response-Header Field
The syntax of the QUERY-UTF-8 request/response-header field is
defined as follows:
query-utf-8 = "Query-UTF-8" ":" ( "Yes" | "No" )
Both "Yes" and "No" above are case insensitive. I.e. "Yes" as well as
"yes" or "yES", and so on, are acceptable.
As a response-header field (sent from the server to the client), the
field indicates whether the user agent can send back the query compo-
nent encoded as UTF-8 or not. If the value is "Yes", and the scheme
component and site component of the URL of the document containing
Expires End of January 1998 [Page 4]
Internet Draft Internationalized Query Components July 1997
the FORM and the URL given for query submission are identical, the
query component SHOULD be sent back encoded as UTF-8. If the value
is "No", and the FORM does not have an Accept-charset attribute that
contains the "charset" parameter value "UTF-8", then the query compo-
nent MUST be sent back according to the application convention
described in Section 3, or in some other way by older browsers.
As a request-header field (sent from the client to the server; the
term request-header field is somewhat misleading here), the field
indicates whether the query component is encoded as UTF-8. A Query-
UTF-8 request-header field MUST be sent back when the following con-
ditions are all met:
- The URL sent back contains a query compontent
- The document containing the FORM is received with a Query-UTF-8
response-header field with value "Yes" or the Accept-Charset
attribute of the FORM contains the charset parameter value of
"UTF-8".
- The client recognizes the corresponding syntax. (The intention of
the last sentence is to be able to phase out Query-UTF-8 after a
transitory period.)
3.2 Rationale
The availability of both the Accept-charset attribute on FORM and the
Query-UTF-8 response-header field may seem unnecessary. The rationale
for this is to allow two modes of operation, called server-driven and
script-driven.
In script-driven mode, the CGI script handles character encoding
negotiation and identification. Typically, the author of a FORM docu-
ment and the corresponding CGI script will use the Accept-charset
attribute on FORM with the value "UTF-8" to tell the client to send
back data in UTF-8. It will then check for the presence and value of
the Query-UTF-8 request-header field in the response from the client,
and make conversions if necessary.
In server-driven mode, the character encoding that a CGI scripts
expects to receive is registered with the server in a similar way as
the character encodings of documents (including those generated by
CGI scripts) are registered. A server offering such a functionality
adds the Query-UTF-8 response-header field with value "Yes" to outgo-
ing documents containing FORMs, and converts from UTF-8 back to the
encoding the CGI script is expecting when a query arrives with
Expires End of January 1998 [Page 5]
Internet Draft Internationalized Query Components July 1997
"Query-UTF-8: Yes".
The distinction between script-driven and server-driven mode is not
made based on whether Query-UTF-8 or the Accept-Charset attribute are
used. Both features are provided because it is easier for a document
author to use Accept-Charset, and easier for a server to add Query-
UTF-8. Also, because a server does not know about the facilities
available on other servers, "Query-UTF-8: Yes" sent from the server
to the client is only valid if the query result is sent back to the
same server. For query results sent to other servers, the Accept-
Charset attribute must be used.
Acknowledgements
I am grateful in particular to the following persons for their help
and/or criticism: Roy Fielding, Eric van der Poel, Francois Yergeau,
Gavin Nicol, Frank Tang, Larry Masenter, and Tim Greenwood.
Bibliography
[Fileupload] E. Nebel and L. Masinter, "Form-based File Upload in
HTML", draft-ietf-html-fileupload-03.txt, August 1995.
[FTPINT] B. Curtin, "Internationalization of the File Transfer
Protocol", draft-ietf-ftpext-intl-ftp-02.txt, June
1997.
[HTTP1.1] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, and T.
Berners-Lee, "Hypertext Transfer Protocol --
HTTP/1.1", RFC 2068, January 1997.
[HTML4.0] D. Raggett, A. Le Hors, and I. Jacobs, "HTML 4.0 Spec-
ification", http://www.w3.org/TR/WD-html40/, July
1997.
[IMAPURL] Ch. Newman, "IMAP URL Scheme", draft-newman-url-
imap-10.txt, July 1997.
[RFC1738] T. Berners-Lee, L. Masinter, and M. McCahill, "Uniform
Resource Locators (URL)", CERN, Dec. 1994.
[RFC2044] F. Yergeau, "UTF-8, A Transformation Format of Unicode
and ISO 10646", Alis Technologies, October 1996.
Expires End of January 1998 [Page 6]
Internet Draft Internationalized Query Components July 1997
[RFC2045] N. Freed, N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message
Bodies", November 1996.
[RFC2070] F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Inter-
nationalization of the Hypertext Markup Language", RFC
2070, January 1997 (Note: This RFC is currently being
updated to reference Unicode 2.0 and ISO 10646 includ-
ing AM-5. The new definition of UTF-8 should be used).
[RFC2130] C. Weider C. Preston, K. Simonsen, H. Alvestrand, R.
Atkinson, M. Crispin, P. Svanberg, "The Report of the
IAB Character Set Workshop held 29 February - 1 March,
1996", April 1997.
[URLprocess] L. Masinter, D. Zigmond and H. Alvestrand, "Guidelines
and Process for new URL Schemes", draft-masinter-url-
process-01.txt, March 1997.
[URLsyntax] T. Berners-Lee, R. Fielding, L. Masinter, "Uniform
Resource Locators (URL): Generic Syntax and Seman-
tics", draft-fielding-url-syntax-05.txt, May 1997.
Author's Address
Martin J. Duerst
Multimedia-Laboratory
Department of Computer Science
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich
Switzerland
Tel: +41 1 257 43 16
Fax: +41 1 363 00 35
E-mail: mduerst@ifi.unizh.ch
NOTE -- Please write the author's name with u-Umlaut wherever
possible, e.g. in HTML as Dürst.
Expires End of January 1998 [Page 7]
| PAFTECH AB 2003-2026 | 2026-04-24 07:20:04 |