One document matched: draft-kwiatkowski-base85-for-xml-02.txt

Differences from draft-kwiatkowski-base85-for-xml-01.txt



INTERNET-DRAFT
draft-kwiatkowski-base85-for-xml-02.txt                   P. Kwiatkowski
Category: Experimental
Expires:  November 2004                                         May 2004


                   A Base-85 Encoding Suitable for XML


Status of this Memo

   This document is an Internet-Draft and is subject to all provisions
   of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/1id-abstracts.html

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html


Abstract

   This memo proposes a base-85 text encoding for arbitrary binary data
   that is suitable for use in XML documents.  This encoding requires
   approximately 15/16 of the space of the MIME Base64 encoding that is
   currently supported as a primitive datatype in the XML Schema
   definition language.  In a UTF-8 encoded XML entity, Base85 therefore
   has 3/4 of the overhead of Base64.













Kwiatkowski               Expires - March 2004                  [Page 1]

                  A Base-85 Encoding Suitable for XML      November 2004


Table of Contents

   1. Introduction...................................................2
   2. Basic Encoding.................................................3
      2.1 Digits.....................................................3
      2.2 Mapping....................................................4
   3. Additional Features............................................5
      3.1 Padding....................................................5
      3.2 Zero-Compression...........................................6
   4. Detailed Example...............................................7
      4.1 Encoding...................................................7
      4.2 Decoding...................................................8
   5. Implementation Specifics......................................10
   6. Comparison with Base64........................................11
   Security Considerations..........................................11
   Normative References.............................................12
   Informative References...........................................12
   Acknowledgments..................................................12
   Author's Address.................................................12
   Intellectual Property Rights.....................................12
   Full Copyright Notice............................................13


1. Introduction

   The XML Schema definition language includes "base64Binary" as a
   primitive datatype for representing arbitrary binary information as
   text.  The data is encoded using MIME's Base64 Content-Transfer-
   Encoding [MIME].  MIME uses a 65-character subset of US-ASCII to
   fulfill the portability requirements of a mail encoding, but since
   XML documents must support all Unicode characters, there is no reason
   to limit the choice of characters so strictly in that context.

   Base-85 encodings are a well-understood technique to encode 4 octets
   of arbitrary data in 5 printable characters, using an alphabet of
   only 85 distinct characters.  Examples include PostScript's
   ASCII85Encode Filter and the btoa/atob utilities.  However, these
   both indiscriminately use a contiguous range of printable characters.

   Since certain characters must be escaped in XML content, a non-
   contiguous set of characters must be used to represent the 85
   "digits" needed for the encoding, but these can easily be mapped to
   and from the numbers 0-84 in constant time.  A similar approach is
   described in RFC 1924 [RFC1924], but it must be emphasized that while
   that document was an "April 1" satire, the present memo is a serious
   proposal.





Kwiatkowski               Expires - March 2004                  [Page 2]

                  A Base-85 Encoding Suitable for XML      November 2004


   MIME's Base64 also requires that the number of characters in the
   encoded text must always be a multiple of 4, and uses a special
   padding character if necessary.  There is no analogous requirement in
   the present proposal, although an optional padding character is
   supported.

   The key words "MUST" and "MUST NOT" in this document are to be
   interpreted as described in RFC 2119 [RFC2119].


2. Basic Encoding

   The two main design decisions for a binary-as-text encoding are the
   subset of printable characters used and the mapping between the
   binary octets and characters.

   2.1 Digits

      The 85 characters chosen in this proposal to serve as "digits" in
      the encoded string are as follows:

         0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefg
         hijklmnopqrstuvwxy!#$()*+,-./:;=?@^`{|}~z_

      This set, and certain aspects of the above ordering, have been
      selected carefully.  Of the 95 printable ASCII characters
      (including the space character), all but 10 must be used.  The
      alphanumeric characters are obvious candidates for inclusion, so
      10 of the remaining 33 printable characters must be excluded.

      The XML standard places strict constraints on the choice of
      characters [XML].  The ampersand character (&) and the left angle
      bracket (<) are excluded since these must be escaped in XML
      attribute values.  The right angle bracket (>) must be escaped
      when it appears in the string "]]>", so it is safest to exclude
      that character and therefore guarantee that Base85-encoded data
      can be used in any context.  The single-quote character (') must
      be escaped in attribute values delimited by single-quotes, and the
      double-quote character (") must be escaped in attribute values
      delimited by double-quotes.  To avoid constraining the choice of
      quote character for an attribute value containing a Base85-encoded
      string, both of these characters are excluded.  The space
      character is excluded, since sequences of space characters are
      replaced by a single space character during attribute-value
      normalization.  Finally, since parameter entity references can
      appear in general entity value literals, and use a percent-sign
      (%) as a leading delimiter, this character is also excluded.




Kwiatkowski               Expires - March 2004                  [Page 3]

                  A Base-85 Encoding Suitable for XML      November 2004


      Wrapping Base85-encoded data in a CDATA section would allow all of
      the above characters (and forbid use of the right square bracket).
      However requiring a CDATA wrapper around all Base85-encoded data
      is both undesirable and unnecessary.

      This leaves the following 26 printable ASCII characters, 3 more of
      which can be excluded:

         !#$()*+,-./:;=?@[\]^_`{|}~

      The backslash (\) is a sensible choice for exclusion since it is
      often used as an escape character.  Finally, the left square
      bracket ([) and right square bracket (]) can serve as useful
      delimiters for Base85-encoded data in the midst of text strings,
      and are therefore excluded from the set.

      This alphabet of 85 characters is mapped to the numbers 0-84 in
      the order shown above.  Note that the characters for the decimal
      digits and the upper-case letters from A to F are deliberately
      mapped to the numbers they represent in hexadecimal.

   2.2 Mapping

      As with other base-85 schemes, octets in the binary stream are
      divided into groups ("quanta") of four.  Each quantum is then
      converted to a 32-bit unsigned integer, which is in turn
      repeatedly divided by 85.  The remainders form the digits of the
      Base85 "number", and these digits in the range 0-84 are mapped to
      the alphabet of characters defined above.

      Two interesting, and independent, design decisions are the order
      in which the remainders are listed in the encoded string, and how
      the octets are converted to a single integer.  While most encoded
      strings will not be human-readable, 32-bit integers with values
      less than 16 will result in recognizable strings if a big-endian
      format is used for the remainders.  However, these 32-bit integers
      will only have values of interest if a) the binary data happens to
      represent 32-bit integers, and b) the system architecture's
      endianness matches that chosen for the octet-to-integer
      conversion.

      A big-endian format therefore seems sensible for the remainders,
      given that XML is intended to be human-readable when possible.  As
      for the octet-to-integer conversion, Network Byte Ordering (big-
      endian) is the natural choice for an XML encoding.  Since this is
      only to improve readability in rare cases, there is no good reason
      to parameterize an encoding with its endianness.




Kwiatkowski               Expires - March 2004                  [Page 4]

                  A Base-85 Encoding Suitable for XML      November 2004


      The following example shows 8 octets of data (in hex notation) and
      the corresponding 10 characters in the Base85-encoded string:

         [00, 00, 00, 01, 00, 00, 00, 0F] => "000010000F"

      If the number of octets to be encoded is not an integral multiple
      of 4, the trailing 3, 2 or 1 bytes are encoded using the same
      rules to 4, 3 or 2 characters respectively:

         [00, 00, 00, 01, 00, 00, 0F]     => "00001000F"
         [00, 00, 00, 01, 00, 0F]         => "0000100F"
         [00, 00, 00, 01, 0F]             => "000010F"

      When a Base85-encoded string is to be decoded back into binary
      octets, it is assumed that the number of characters to be decoded
      is known (MIME's Base64 makes this assumption, too, as the padding
      character is not always present).  The conversion of the last
      quantum is handled as a special case: if it is 5, 4, 3 or 2
      characters in length, it is known to represent 4, 3, 2 or 1
      octets, respectively.

      This encoding scheme does not support bit strings of arbitrary
      length.  The bits to be encoded must comprise an integral number
      of octets.


3. Additional Features

   Base-85 encodings leverage the fact that 85^5 > 2^32.  Actually, 83 *
   85^4 and 84^2 * 85^3 are also larger than 2^32, so it is possible to
   reserve certain values for one or two Base85 digits and use those to
   convey additional information.

   3.1 Padding

      Some applications might prefer Base85-encoded strings to be padded
      to a given fixed length.  A character outside the Base85 alphabet
      could be used, but the application would then be responsible for
      trimming the extra characters before passing the string to a
      standard Base85 decoder.

      It is possible to instead use one of the 85 characters as a
      padding character by disallowing it as the least significant digit
      in an encoded quantum.  So, the last digit is effectively in
      "base-84" and the encoding/decoding algorithms must
      divide/multiply by 84 once instead of 85.  The padding character
      is therefore the character that maps to the value 84; the
      underscore (_) is chosen for this purpose.



Kwiatkowski               Expires - March 2004                  [Page 5]

                  A Base-85 Encoding Suitable for XML      November 2004


      (The choice of the underscore character deliberately contrasts
      with the use of the equals (=) character as padding in Base32 and
      Base64, to emphasize the different role of padding characters in
      Base85.)

      The following examples illustrate the range of possible encoded
      strings:

         "00000" => [00, 00, 00, 00] (see Section 3.2)
         "zL@33" => [FF, FF, FF, FF] (see Section 3.2)
         "0000_" => [00, 00, 00]
         "Rs$$_" => [FF, FF, FF]
         "000__" => [00, 00]
         "9FF__" => [FF, FF]
         "00___" => [00]
         "33___" => [FF]

      Any number of underscores can be appended to an encoded string
      without altering its value.  To remove the padding, the decoder
      simply has to strip trailing underscores from the string.

      Optional padding implies that a round-trip conversion from text to
      binary and back again might not yield the original text.  However,
      as long as a client that cares about round-trip preservation
      passes consistent desired lengths to the encoder, the encoded text
      will always be the same.  An encoding with no padding can be
      considered the canonical form of a Base85-encoded string.

   3.2 Zero-Compression

      The public-domain btoa/atob utilities use the 'z' character to
      represent the special case of a quantum of 4 zero bytes.  Base85
      can use the same trick, without having to reserve an 86th
      character, by disallowing 'z' as the most significant digit in an
      encoded quantum.  This only affects the encoding/decoding of 5-
      character quanta, in which the most significant digit can have a
      maximum value of 83.  The 'z' character is deliberately mapped to
      the value 83 in the Base85 alphabet.  While the underscore
      normally represents the value 84, if it is used as the most
      significant digit of a 5-character quantum, it is mapped to 83
      instead.

      If a decoder encounters a 'z' character at the beginning of a
      quantum, that character is interpreted as an entire quantum of 4
      zero bytes.  To ensure that round-trip conversions yield the same
      result, the quantum "00000" is considered an encoding violation.





Kwiatkowski               Expires - March 2004                  [Page 6]

                  A Base-85 Encoding Suitable for XML      November 2004


      The following examples incorporate this modification:

         "00000" => encoding violation
         "z"     => [00, 00, 00, 00]
         "zL@33" => [00, 00, 00, 00, CA, C1, 73]
         "00001" => [00, 00, 00, 01]
         "_L@33" => [FF, FF, FF, FF]
         "0000_" => [00, 00, 00]
         "000__" => [00, 00]
         "00___" => [00]
         "zz00_" => [00, 00, 00, 00, 00, 00, 00, 00, 00]
         "_00zz" => [FF, 35, 5A, 1B]


4. Detailed Example

   This section walks through the encoding and decoding of a sample
   octet sequence in more detail.

   4.1 Encoding

      Beginning with the octet sequence:

         [FF, 3E, 79, 5F, 00, 00, 00, 00, 3C, C3]

      we isolate the first quantum of four octets and convert them to a
      single unsigned integer:

            FF3E795F hex = 4282284383 decimal

      To obtain the least significant encoded digit, we divide by 84 and
      note the remainder:

         4282284383 / 84 = 50979575 rem 83

      To obtain the remaining digits, from least to most significant, we
      repeatedly divide by 85 and note the remainders:

           50979575 / 85 = 599759   rem 60
             599759 / 85 = 7055     rem 84
               7055 / 85 = 83       rem 0
                 83 / 85 = 0        rem 83

      (The last division is clearly unnecessary in practice.)

      This yields the following digits (shown in decimal):

         (83, 0, 84, 60, 83)



Kwiatkowski               Expires - March 2004                  [Page 7]

                  A Base-85 Encoding Suitable for XML      November 2004


      These values are then used as indices into the Base85 alphabet,
      observing the exception that a leading 83 is represented with an
      underscore, rather than the 'z' character.  We obtain the string:

         "_0_yz"

      Proceeding to the next quantum, we find four zero bytes, which we
      represent with a single 'z' character.  Our encoded string is now:

         "_0_yzz"

      For the remaining quantum of 2 octets, we use the same process as
      for the first quantum, but generate 3 encoded digits instead of 5:

           3CC3 hex = 15555 decimal

         15555 / 84 = 185 rem 15
           185 / 85 = 2   rem 15
             2 / 85 = 0   rem 2

      These digits map straightforwardly to the string "2FF", yielding
      the final encoding string:

         "_0_yzz2FF"

      This can be padded with an arbitrary number of underscores.  In
      this example, we'll assume the client wishes to pad it to a total
      of 16 characters:

         "_0_yzz2FF_______"

   4.2 Decoding

      To decode the above string, we begin by stripping off all trailing
      underscore characters:

         "_0_yzz2FF"

      We then inspect the first character to see if it is a 'z'.  Since
      it is not, we isolate the first quantum of 5 characters and map
      them to decimal values based on their position in the Base85
      alphabet.  Since this is a quantum of size 5, we observe the
      exception that a leading underscore maps to 83 instead of the
      usual 84 (note that leading underscores will only arise with
      quanta of size 5):

         (83, 0, 84, 60, 83)




Kwiatkowski               Expires - March 2004                  [Page 8]

                  A Base-85 Encoding Suitable for XML      November 2004


      We now convert this to a single integer by following a process
      that corresponds to the repeated division above:

           ((((83 * 85) + 0) * 85 + 84) * 85 + 60) * 84 + 83
         = 4282284383 decimal
         = FF3E795F hex

      Performing a big-endian conversion to bytes, we have the initial
      octets:

         [FF, 3E, 79, 5F]

      Examining the remainder of the string:

         "z2FF"

      we note that the first character is now a 'z'.  This is
      immediately consumed as a quantum and 4 zero bytes are appended to
      the decoded result:

         [FF, 3E, 79, 5F, 00, 00, 00, 00]

      The remainder of the string is a single quantum comprising 3
      characters, since it does not begin with a 'z':

         "2FF"

      We map these characters to the decimal digits (2, 15, 15) and once
      again multiply and add repeatedly to obtain a single integer:

           ((2 * 85) + 15) * 84 + 15
         = 15555 decimal
         = 3CC3 hex

      Applying a big-endian conversion and appending the resulting
      octets to the decoded data, we get the final result:

         [FF, 3E, 79, 5F, 00, 00, 00, 00, 3C, C3]













Kwiatkowski               Expires - March 2004                  [Page 9]

                  A Base-85 Encoding Suitable for XML      November 2004


5. Implementation Specifics

   Section 2 of RFC 3548 provides several specific implementation
   recommendations for base encoded data [RFC3548].  In particular, it
   mandates that:

      Implementations MUST NOT add line feeds to base encoded data
      unless the specification referring to this document explicitly
      directs base encoders to add line feeds after a specific number of
      characters.

   and:

      Implementations MUST reject the encoding if it contains characters
      outside the base alphabet when interpreting base encoded data,
      unless the specification referring to this document explicitly
      states otherwise.  Such specifications may, as MIME does, instead
      state that characters outside the base encoding alphabet should
      simply be ignored when interpreting data ("be liberal in what you
      accept").  Note that this means that any CRLF constitute "non
      alphabet characters" and are ignored.

   In general, implementations MUST follow the above recommendations for
   Base85, too, unless the specification referring to this document
   explicitly states otherwise.  However, in the specific case of an XML
   document, it is reasonable to allow, and ignore, whitespace
   characters (space, CR, LF or TAB) in the midst of a Base85-encoded
   string.

   In Base85, padding is optional, but implementations should know when
   it might be present, given that it can affect round-trip behavior.
   Hence, this recommendation for Base85:

      Implementations MUST NOT include pad characters at the end of
      encoded data unless the specification referring to this document
      explicitly states otherwise.

   A C++ reference implementation of a Base85 encoder/decoder can be
   downloaded at:

      https://sourceforge.net/projects/base85-for-xml

   This source code has been released to the public domain.








Kwiatkowski               Expires - March 2004                 [Page 10]

                  A Base-85 Encoding Suitable for XML      November 2004


6. Comparison with Base64

   In a UTF-8 encoded XML entity, Base85-encoded data will be about 25
   percent larger than the unencoded data.  In other character
   encodings, such as UTF-16, Base85 may be far from optimal, but it
   will still compare favorably with Base64.

   Asymptotically, Base85 encoded data requires 15/16 (93.75%) of the
   storage needed by Base64.  For long runs of zero bytes, the asymptote
   shrinks to 3/16 (18.75%).  The comparison is slightly more favorable
   for shorter blocks of data.  Ignoring zero-compression, data blocks
   of 1 to 32 bytes in length will require, on average, 89.84% of the
   space needed by Base64 (this is largely due to the padding
   requirement in Base64).  In particular, 128-bit numbers can be
   represented with just 20 characters in Base85, compared to 24 in
   Base64, so the ratio is 83.33% for that common case.

   Put another way, Base85 asymptotically has only 3/4 of the overhead
   of Base64 (in a UTF-8 encoded entity).  For a 128-bit number, it has
   1/2 the overhead.


Security Considerations

   When implementing base encoding and decoding, care should be taken
   not to introduce vulnerabilities to buffer overflow attacks, or other
   attacks on the implementation.  A decoder should not break on invalid
   input including, for example, embedded NUL characters (ASCII 0).

   If invalid digits are ignored, instead of causing rejection of the
   entire encoding (as recommended), a covert channel that can be used
   to "leak" information is made possible.  The implications of this
   should be understood in applications that do not follow the
   recommended practice.  Whitespace in the midst of Base85-encoded data
   in XML documents can of course form such a covert channel.  However,
   this is true for all whitespace in an XML document, so applications
   that read and write XML documents must be aware of this anyway.

   Base encoding visually hides otherwise easily recognized information,
   such as passwords, but does not provide any computational
   confidentiality.  This has been known to cause security incidents
   when, for example, a user reports details of a network protocol
   exchange (perhaps to illustrate some other problem) and accidentally
   reveals the password because he or she is unaware that the base
   encoding does not protect the password.






Kwiatkowski               Expires - March 2004                 [Page 11]

                  A Base-85 Encoding Suitable for XML      November 2004


Normative References

   [XML]     Extensible Markup Language (XML) 1.0 (Second Edition),
             T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler (Eds.)
             http://www.w3.org/TR/2000/REC-xml-20001006, 6 October 2000.

   [RFC2119] Key words for use in RFCs to Indicate Requirement Levels,
             S. Bradner, BCP 14, RFC 2119, March 1997.


Informative References

   [MIME]    Multipurpose Internet Mail Extensions (MIME) Part One:
             Format of Internet Message Bodies, N. Freed, N. Borenstein,
             RFC 2045, November 1996.

   [RFC3548] The Base16, Base32, and Base64 Data Encodings,
             S. Josefsson, RFC 3548, July 2003.

   [RFC1924] A Compact Representation of IPv6 Addresses,
             R. Elz, RFC 1924, 1 April 1996.
             (Once again, the present memo is a serious proposal,
              despite the fact that it references an "April 1" RFC.)


Acknowledgments

   I wish to thank Robert Elz for promptly answering my questions
   regarding his memo, and for commenting on this document.  I also wish
   to thank Simon Josefsson for letting me re-use the text of RFC 3548's
   security section, and for his detailed responses to my questions
   about that document.


Author's Address

   Paul Kwiatkowski
   PMB 785
   15600 NE 8th Street, Suite B1
   Bellevue, WA  98008
   USA

   Email: paulkw@paulkw.com


Intellectual Property Rights

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to


Kwiatkowski               Expires - March 2004                 [Page 12]

                  A Base-85 Encoding Suitable for XML      November 2004


   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; neither does it represent that it
   has made any effort to identify any such rights.  Information on the
   IETF's procedures with respect to rights in standards-track and
   standards-related documentation can be found in BCP-11.  Copies of
   claims of rights made available for publication and any assurances of
   licenses to be made available, or the result of an attempt made to
   obtain a general license or permission for the use of such
   proprietary rights by implementors or users of this specification can
   be obtained from the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard.  Please address the information to the IETF Executive
   Director.


Full Copyright Notice

   Copyright (C) The Internet Society (2003). All rights reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.  However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.





Kwiatkowski               Expires - March 2004                 [Page 13]


PAFTECH AB 2003-20262026-04-24 19:52:38