One document matched: draft-kwiatkowski-base85-for-xml-00.txt
INTERNET-DRAFT
draft-kwiatkowski-base85-for-xml-00.txt P. Kwiatkowski
Category: Experimental
Expires: March 2003 September 2002
A Base-85 Encoding Suitable for XML
Status of this Memo
This document is an Internet-Draft and is subject to all provisions
of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/1id-abstracts.html
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Abstract
This memo proposes a base85 text encoding for arbitrary binary data
that is suitable for use in XML documents. This encoding requires
approximately 15/16 of the space of the MIME Base64 encoding that is
currently supported as a primitive datatype in the XML Schema
definition language. In a UTF-8 encoded XML entity, Base85 therefore
has 3/4 of the overhead of Base64.
Kwiatkowski Expires - March 2003 [Page 1]
A Base-85 Encoding Suitable for XML September 2002
Table of Contents
1. Introduction...................................................2
2. Basic Encoding.................................................3
2.1 Digits.....................................................3
2.2 Mapping....................................................4
3. Additional Features............................................5
3.1 Padding....................................................5
3.2 Zero-Compression...........................................6
4. Detailed Example...............................................6
4.1 Encoding...................................................6
4.2 Decoding...................................................8
5. Comparison with Base64.........................................9
Security Considerations...........................................9
Normative References..............................................9
Informative References...........................................10
Acknowledgments..................................................10
Author's Address.................................................10
1. Introduction
The XML Schema definition language includes "base64Binary" as a
primitive datatype for representing arbitrary binary information as
text. The data is encoded using MIME's Base64 Content-Transfer-
Encoding [MIME]. MIME uses a 65-character subset of US-ASCII to
fulfill the portability requirements of a mail encoding, but since
XML documents must support all Unicode characters, there is no reason
to limit the choice of characters so strictly.
Base-85 encodings are a well-understood technique to encode 4 octets
of arbitrary data in 5 printable characters, using an alphabet of
only 85 distinct characters. Examples include PostScript's
ASCII85Encode Filter and the btoa/atob utilities. However, these
indiscriminately use a contiguous range of printable characters.
Since certain characters must be escaped in XML content, a non-
contiguous set of characters must be used to represent the 85
"digits" needed for the encoding, but these can easily be mapped to
and from the numbers 0-84 in constant time. A similar approach is
described in RFC 1924 [Elz], but it must be emphasized that while
that document was an "April 1" satire, the present memo is a serious
proposal.
MIME's Base64 also requires that the number of characters in the
encoded text must always be a multiple of 4, and uses a special
padding character if necessary. There is no analogous requirement in
the present proposal, although an optional padding character is
supported.
Kwiatkowski Expires - March 2003 [Page 2]
A Base-85 Encoding Suitable for XML September 2002
2. Basic Encoding
The two main design decisions for a binary-as-text encoding are the
subset of printable characters used and the mapping between the
binary octets and characters.
2.1 Digits
The 85 characters chosen in this proposal to serve as "digits" in
the encoded string are as follows:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg
hijklmnopqrstuvwxy!#$()*+,-./:;=?@^`{|}~z_
This set, and certain aspects of the above ordering, have been
selected carefully. Of the 95 printable ASCII characters
(including the space character), all but 10 must be used. The
alphanumeric characters are obvious candidates for inclusion. The
space character is excluded so that Base85-encoded values will be
single tokens.
The XML standard places further constraints on the choice of
characters [XML]. The ampersand character (&) and the left angle
bracket (<) are excluded since these must be escaped in XML
attribute values. The right angle bracket (>) must be escaped
when it appears in the string "]]>", so it is safest to exclude
that character and therefore guarantee that Base85-encoded data
can be used in any context. The single-quote character (') must
be escaped in attribute values delimited by single-quotes, and the
double-quote character (") must be escaped in attribute values
delimited by double-quotes. To avoid constraining the choice of
quote character for an attribute value containing a Base85-encoded
string, both of these characters are excluded. Finally, since
parameter entity references can appear in general entity value
literals, and use a percent-sign (%) as a leading delimiter, this
character is also excluded.
This leaves the following 26 printable ASCII characters, 3 more of
which can be excluded:
!#$()*+,-./:;=?@[\]^_`{|}~
The backslash (\) is a sensible choice for exclusion since it is
often used as an escape character. Finally, the left square
bracket ([) and right square bracket (]) can serve as useful
delimiters for base85-encoded data in the midst of text strings,
and are therefore excluded from the set.
Kwiatkowski Expires - March 2003 [Page 3]
A Base-85 Encoding Suitable for XML September 2002
This alphabet of 85 characters is mapped to the numbers 0-84 in
the order shown above. Note that the characters for the decimal
digits and the upper-case letters from A to F are deliberately
mapped to the numbers they represent in hexadecimal.
2.2 Mapping
As with other base-85 schemes, octets in the binary stream are
divided into groups ("quanta") of four. Each quantum is then
converted to a 32-bit unsigned integer, which is in turn
repeatedly divided by 85. The remainders form the digits of the
Base85 "number", and these digits in the range 0-84 are mapped to
the alphabet of characters defined above.
Two interesting, and independent, design decisions are the order
in which the remainders are listed in the encoded string, and how
the octets are converted to a single integer. While most encoded
strings will not be human-readable, very small 32-bit integers
will result in recognizable strings if a big-endian format is used
for the remainders. However, these 32-bit integers will only have
values of interest if a) the binary data happens to represent 32-
bit integers, and b) the system architecture's endianness matches
that chosen for the octet-to-integer conversion.
A big-endian format therefore seems sensible for the remainders,
given that XML is intended to be human-readable when possible. As
for the octet-to-integer conversion, Network Byte Ordering (big-
endian) is the natural choice for an XML encoding. Since this is
only to improve readability in rare cases, there is no good reason
to parameterize an encoding with its endianness.
The following example shows 8 octets of data (in hex notation) and
the corresponding 10 characters in the Base85-encoded string:
[00, 00, 00, 01, 00, 00, 00, 0F] => "000010000F"
If the number of octets to be converted is not an integral
multiple of 4, the trailing 1, 2 or 3 bytes are converted using
the same rules to 2, 3 or 4 characters respectively:
[00, 00, 00, 01, 00, 00, 0F] => "00001000F"
[00, 00, 00, 01, 00, 0F] => "0000100F"
[00, 00, 00, 01, 0F] => "000010F"
When a Base85-encoded string is to be converted back into binary,
it is assumed that the number of characters to be converted is
known (MIME's Base64 makes this assumption, too, as the padding
character is not always present). The conversion of the last
quantum is handled as a special case.
Kwiatkowski Expires - March 2003 [Page 4]
A Base-85 Encoding Suitable for XML September 2002
3. Additional Features
Base-85 encodings leverage the fact that 85^5 > 2^32. Actually, 83 *
85^4 and 84^2 * 85^3 are also larger than 2^32, so it is possible to
reserve certain values for one or two Base85 digits and use those to
convey additional information.
3.1 Padding
Some applications might prefer Base85-encoded strings to be padded
to a given fixed length. A character outside the Base85 alphabet
could be used, but the application would then be responsible for
trimming the extra characters before passing the string to a
standard Base85 converter.
It is possible to instead use one of the 85 characters as a
padding character by disallowing it as the least significant digit
in an encoded quantum. So, the last digit is effectively in "base
84" and the encoding/decoding algorithms must divide/multiply by
84 once instead of 85. The padding character is therefore the
character that maps to the value 84; the underscore (_) is chosen
for this purpose.
The following examples illustrate the range of possible encoded
strings:
"00000" => [00, 00, 00, 00]
"zL@33" => [FF, FF, FF, FF]
"0000_" => [00, 00, 00]
"Rs$$_" => [FF, FF, FF]
"000__" => [00, 00]
"9FF__" => [FF, FF]
"00___" => [00]
"33___" => [FF]
Any number of underscores can be appended to an encoded string
without altering its value. To remove the padding, the decoder
simply has to strip trailing underscores from the string.
Optional padding implies that a round-trip conversion from text to
binary and back again might not yield the original text. However,
as long a client that cares about round-trip preservation passes
consistent desired lengths to the encoder, the encoded text will
always be the same. An encoding with no padding can be considered
the canonical form of a Base85-encoded string.
Kwiatkowski Expires - March 2003 [Page 5]
A Base-85 Encoding Suitable for XML September 2002
3.2 Zero-Compression
The public-domain btoa/atob utilities use the 'z' character to
represent the special case of a quantum of 4 zero bytes. Base85
can use the same trick, without having to reserve an 86th
character, by disallowing 'z' as the most significant digit in an
encoded quantum. This only affects the encoding/decoding of 5-
character quanta, in which the most significant digit can have a
maximum value of 83. The 'z' character is deliberately mapped to
the value 83 in the Base85 alphabet. While the underscore
normally represents the value 84, if it is used as the most
significant digit of a 5-character quantum, it is mapped to 83
instead.
If a decoder encounters a 'z' character at the beginning of a
quantum, that character is interpreted as an entire quantum of 4
zero bytes. To ensure that round-trip conversions yield the same
result, the quantum "00000" is considered an encoding violation.
The following examples incorporate this modification:
"00000" => encoding violation
"z" => [00, 00, 00, 00]
"zL@33" => [00, 00, 00, 00, CA, C1, 73]
"_L@33" => [FF, FF, FF, FF]
"0000_" => [00, 00, 00]
"000__" => [00, 00]
"00___" => [00]
"zz00_" => [00, 00, 00, 00, 00, 00, 00, 00, 00]
"_00zz" => [FF, 35, 5A, 1B]
4. Detailed Example
This section walks through the encoding and decoding of a sample
octet sequence in more detail.
4.1 Encoding
Beginning with the octet sequence:
[FF, 3E, 79, 5F, 00, 00, 00, 00, 3C, C3]
we isolate the first quantum of four octets and convert them to a
single unsigned integer:
FF3E795F hex = 4282284383 decimal
Kwiatkowski Expires - March 2003 [Page 6]
A Base-85 Encoding Suitable for XML September 2002
To obtain the least significant encoded digit, we divide by 84 and
note the remainder:
4282284383 / 84 = 50979575 rem 83
To obtain the remaining digits, from least to most significant, we
repeatedly divide by 85 and note the remainders:
50979575 / 85 = 599759 rem 60
599759 / 85 = 7055 rem 84
7055 / 85 = 83 rem 0
83 / 85 = 0 rem 83
(The last division is clearly unnecessary in practice.)
This yields the following digits (shown in decimal):
(83, 0, 84, 60, 83)
These values are then used as indices into the Base85 alphabet,
observing the exception that a leading 83 is represented with an
underscore, rather than the 'z' character. We obtain the string:
"_0_yz"
Proceeding to the next quantum, we find four zero bytes, which we
represent with a single 'z' character. Our encoded string is now:
"_0_yzz"
For the remaining quantum of 2 octets, we use the same process as
for the first quantum, but generate 3 encoded digits instead of 5:
3CC3 hex = 15555 decimal
15555 / 84 = 185 rem 15
185 / 85 = 2 rem 15
2 / 85 = 0 rem 2
These digits map straightforwardly to the string "2FF", yielding
the final encoding string:
"_0_yzz2FF"
This can be padded with an arbitrary number of underscores. In
this example, we'll assume the client wishes to pad it to a total
of 16 characters:
"_0_yzz2FF_______"
Kwiatkowski Expires - March 2003 [Page 7]
A Base-85 Encoding Suitable for XML September 2002
4.2 Decoding
To decode the above string, we begin by stripping off all trailing
underscore characters:
"_0_yzz2FF"
We then inspect the first character to see if it is a 'z'. Since
it is not, we isolate the first quantum of 5 characters and map
them to decimal values based on their position in the Base85
alphabet. Since this is a quantum of size 5, we observe the
exception that a leading underscore maps to 83 instead of the
usual 84 (note that leading underscores will only arise with
quanta of size 5):
(83, 0, 84, 60, 83)
We now convert this to a single integer by following a process
that corresponds to the repeated division above:
((((83 * 85) + 0) * 85 + 84) * 85 + 60) * 84 + 83
= 4282284383 decimal
= FF3E795F hex
Performing a big-endian conversion to bytes, we have the initial
octets:
[FF, 3E, 79, 5F]
Examining the remainder of the string:
"z2FF"
we note that the first character is now a 'z'. This is
immediately consumed as a quantum and 4 zero bytes are appended to
the decoded result:
[FF, 3E, 79, 5F, 00, 00, 00, 00]
The remainder of the string is a single quantum comprising 3
characters, since it does not begin with a 'z':
"2FF"
We map these characters to the decimal digits (2, 15, 15) and once
again multiply and add repeatedly to obtain a single integer:
Kwiatkowski Expires - March 2003 [Page 8]
A Base-85 Encoding Suitable for XML September 2002
((2 * 85) + 15) * 84 + 15
= 15555 decimal
= 3CC3 hex
Applying a big-endian conversion and appending the resulting
octets to the decoded data, we get the final result:
[FF, 3E, 79, 5F, 00, 00, 00, 00, 3C, C3]
5. Comparison with Base64
In a UTF-8 encoded XML entity, Base85-encoded data will be about 25
percent larger than the unencoded data. In other character
encodings, such as UTF-16, Base85 may be far from optimal, but it
will still compare favorably with Base64.
Asymptotically, Base85 encoded data requires 15/16 (93.75%) of the
storage needed by Base64. For long runs of zero bytes, this figure
shrinks to 3/16 (18.75%). The comparison is slightly more favorable
for shorter blocks of data. Ignoring zero-compression, data blocks
of 1 to 32 bytes in length will require, on average, 89.84% of the
space needed by Base64 (this is largely due to the padding
requirement in Base64). In particular, 128-bit numbers can be
represented with just 20 characters in Base85, compared to 24 in
Base64, so the ratio is 83.33% for that common case.
Put another way, Base85 asymptotically has only 3/4 of the overhead
of Base64 (in a UTF-8 encoded entity). For a 128-bit number, it has
1/2 the overhead.
Security Considerations
Since binary data is typically not human-readable, Base85 encoding
will not enhance the security of a system in any significant way. It
will not have any detrimental effect, either.
Normative References
[XML] Extensible Markup Language (XML) 1.0 (Second Edition),
T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler (Eds.)
http://www.w3.org/TR/2000/REC-xml-20001006, 6 October 2000.
Kwiatkowski Expires - March 2003 [Page 9]
A Base-85 Encoding Suitable for XML September 2002
Informative References
[MIME] Multipurpose Internet Mail Extensions (MIME) Part One:
Format of Internet Message Bodies, N. Freed, N. Borenstein,
RFC 2045, November 1996.
[Elz] A Compact Representation of IPv6 Addresses, R. Elz,
RFC 1924, 1 April 1996
(Once again, the present memo is a serious proposal,
despite the fact that it references an "April 1" RFC.)
Acknowledgments
I wish to thank Robert Elz for promptly answering my questions
regarding his memo.
Author's Address
Paul Kwiatkowski
PMB 785
15600 NE 8th Street, Suite B1
Bellevue, WA 98008
USA
Email: paulkw@paulkw.com
Kwiatkowski Expires - March 2003 [Page 10]
| PAFTECH AB 2003-2026 | 2026-04-24 19:54:56 |