One document matched: draft-boschi-ipfix-anon-03.xml


<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
<!ENTITY draftIpfixFile PUBLIC "" "http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-ipfix-file.xml">
<!ENTITY draftIpfixAs PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-ipfix-as.xml'>
<!ENTITY draftIpfixArchitecture PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-ipfix-architecture.xml'>
<!ENTITY draftIpfixMedframe PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml3/reference.I-D.ietf-ipfix-mediators-framework.xml'>
<!ENTITY rfc3917 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.3917.xml'>
<!ENTITY rfc5101 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5101.xml">
<!ENTITY rfc5102 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.5102.xml">
<!ENTITY rfc2119 PUBLIC "" "http://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml">

] >
<rfc ipr="trust200902" category="exp" docName="draft-boschi-ipfix-anon-03.txt">
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<?rfc toc="yes"?>
<?rfc symrefs="yes"?>

<front>
  <title abbrev="IP Flow Anonymisation Support">
    IP Flow Anonymisation Support 
  </title>
  <author initials="E." surname="Boschi" fullname="Elisa Boschi">
    <organization abbrev="Hitachi Europe">
      Hitachi Europe 
    </organization>
    <address>
      <postal>
        <street>c/o ETH Zurich</street>
        <street>Gloriastrasse 35</street>
        <city>8092 Zurich</city>
        <country>Switzerland</country>
      </postal>
      <phone>+41 44 632 70 57</phone>
      <email>elisa.boschi@hitachi-eu.com</email>
    </address>
  </author>
  <author initials="B." surname="Trammell" fullname="Brian Trammell">
    <organization abbrev="Hitachi Europe">
      Hitachi Europe 
    </organization>
    <address>
      <postal>
        <street>c/o ETH Zurich</street>
        <street>Gloriastrasse 35</street>
        <city>8092 Zurich</city>
        <country>Switzerland</country>
      </postal>
      <phone>+41 44 632 70 13</phone>
      <email>brian.trammell@hitachi-eu.com</email>
    </address>
  </author>
  <date month="March" day="30" year="2009"></date>
  <area>Operations</area>
  <workgroup>IPFIX Working Group</workgroup>
  <abstract> 

    <t>This document describes anonymisation techniques for IP flow data and
    the export of anonymised data using the IPFIX protocol. It provides a
    categorization of common anonymisation schemes and defines the parameters
    needed to describe them. It provides guidelines for the implementation of
    anonymised data export and storage over IPFIX, and describes an
    Options-based method for anonymization metadata export within the
    IPFIX protocol, providing the basis for the definition of information
    models for configuring anonymisation techniques within an IPFIX Metering
    or Exporting Process, and for reporting the technique in use to an IPFIX
    Collecting Process.</t>

  </abstract>
</front>

<middle>

  <section title="Open Issues">

    <t>There is not yet a mechanism for exporting information about
    defined-time anonymisation stability.</t>

    <t>The terminology section is incomplete; we should decide which of the
    terms introduced in this document are to be treated as terminology.</t>

    <t>Between "classes" of techniques and "parameters", there may be
    "properties" as well; for example, binning and timestamp anonymisation may
    be "ordered" or not (x>y in real --> x>y in anonymized). We should verify
    that we're splitting these up correctly.</t>

    <t>In parallel with this, the anonymisationTechnique values might be
    useful as a bitfield, with properties and classes being represented by
    some set of the bits in the field. We'll have to make sure that the
    properties and classes are exhaustive, if we do this.</t>

    <t>Both anonymisationStability and anonymisationTechnique might benefit
    from the creation of IANA registries; HOWEVER, in this case, it would be
    very important to ensure that such a registry contains only classes and
    properties of anonymised data, not information about specific
    algorithms.</t>

    <t>Certain technique/IE combinaitons (e.g. structure-preserving counters)
    don't make any sense; these should be noted in "IPFIX-Specific
    Anonymisation Guidelines".</t>

    <t>Guidelines should be provided for the evaluation of _new_ IEs added to
    the IANA registry after the publication of this draft for their
    anonymisation potential.</t>

    <t>This document does not cover the anonymisation of sub-IP level
    information, specifically MAC addresses. It should.</t>

    <!-- Do we want to add information elements and templates for dissemination and publication policies? -->

  </section>

  <section title="Introduction">

    <t>The standardisation of an IP flow information export protocol <xref target="RFC5101"></xref> and associated representations removes a
    technical barrier to the sharing of IP flow data across organizational
    boundaries and with network operations, security, and research communities
    for a wide variety of purposes. However, with wider dissemination comes
    greater risks to the privacy of the users of networks under measurement,
    and to the security of those networks. While it is not a complete solution
    to the issues posed by distribution of IP flow information, anonymisation
    is an important tool for the protection of privacy within network
    measurement infrastructures.</t>

    <t>This document presents a mechanism for representing anonymised data
    within IPFIX and guidelines for using it. It begins with a categorization
    of anonymisation techniques. It then describes applicability of each
    technique to commonly anonymisable fields of IP flow data, organized by
    information element data type and semantics as in <xref target="RFC5102"></xref>; enumerates the parameters required by each of
    the applicable anonymisation techniques; and provides guidelines for the
    use of each of these techniques in accordance with best practices in data
    protection. Finally, it specifies a mechanism for exporting anonymised
    data and binding anonymisation metadata to templates using IPFIX
    Options.</t>

    <section title="IPFIX Protocol Overview">

      <t>In the IPFIX protocol, { type, length, value } tuples are expressed
      in templates containing { type, length } pairs, specifying which { value
      } fields are present in data records conforming to the Template, giving
      great flexibility as to what data is transmitted. Since Templates are
      sent very infrequently compared with Data Records, this results in
      significant bandwidth savings. Various different data formats may be
      transmitted simply by sending new Templates specifying the { type,
      length } pairs for the new data format. See <xref target="RFC5101"></xref> for more information.</t>

      <t>The <xref target="RFC5102">IPFIX information model</xref> defines a
      large number of standard Information Elements which provide the
      necessary { type } information for Templates. The use of standard
      elements enables interoperability among different vendors'
      implementations. Additionally, non-standard enterprise-specific elements
      may be defined for private use.</t>

    </section>

    <section title="IPFIX Documents Overview" anchor="intro-docs">

      <t><xref target="RFC5101">"Specification of the IPFIX
      Protocol for the Exchange of IP Traffic Flow Information"</xref>
      and its associated documents
      define the IPFIX Protocol, which provides network engineers and
      administrators with access to IP traffic flow information.</t>

      <t><xref target="I-D.ietf-ipfix-architecture">"Architecture for IP Flow
      Information Export"</xref> defines
      the architecture for the export of measured IP flow information out of
      an IPFIX Exporting Process to an IPFIX Collecting Process, and the
      basic terminology used to describe the elements of this architecture,
      per the requirements defined in <xref target="RFC3917">"Requirements
      for IP Flow Information Export"</xref>. The IPFIX Protocol document
      <xref target="RFC5101"></xref> then covers the details of the method for
      transporting IPFIX Data Records and Templates via a congestion-aware
      transport protocol from an IPFIX Exporting Process to an IPFIX
      Collecting Process.</t>

      <t><xref target="RFC5102">"Information Model for IP Flow Information
      Export"</xref> describes the Information Elements used by IPFIX,
      including details on Information Element naming, numbering, and data
      type encoding. Finally, <xref target="I-D.ietf-ipfix-as">"IPFIX
      Applicability"</xref> describes the various applications of the IPFIX
      protocol and their use of information exported via IPFIX, and relates
      the IPFIX architecture to other measurement architectures and
      frameworks.</t>

      <t>Additionally, the <xref target="I-D.ietf-ipfix-file">"Specification
      of the IPFIX File Format"</xref> describes a file format based upon the
      IPFIX Protocol for the storage of flow data.</t>

      <t>This document references the Protocol and Architecture documents for
      terminology, and extends the IPFIX Information Model to provide new
      Information Elements for anonymisation metadata. The anonymisation
      techniques described herein are equally applicable to the IPFIX Protocol
      and data stored in IPFIX Files.</t>

    </section>

  </section>

  <section title="Terminology">

    <t>Terms used in this document that are defined in the Terminology section
    of the <xref target="RFC5101">IPFIX Protocol</xref> document are to be
    interpreted as defined there.</t>

    <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
    document are to be interpreted as described in <xref target="RFC2119">RFC
    2119</xref>.</t>

  </section>

  <section title="Categorisation of Anonymisation Techniques">

    <t>Anonymisation modifies a data set in order to
    protect the identity of the people or entities described by the data set
    from disclosure. With respect to network traffic data, anonymisation
    generally attempts to preserve some set of properties of the network
    traffic useful for a given application or applications, while ensuring the
    data cannot be traced back to the specific networks, hosts, or users
    generating the traffic.</t>

    <t>Anonymisation may be broadly classified according to two properties:
    recoverability and countability. All anonymisation techniques map the real
    space of identifiers or values into a separate, anonymised space,
    according to some function. A technique is said to be recoverable when the
    function used is invertible or can otherwise be reversed and a real
    identifier can be recovered from a given replacement identifier.</t>

    <t>Countability compares the dimension of the anonymised space (N) to the
    dimension of the real space (M), and denotes how the count of unique
    values is preserved by the anonymisation function. If the anonymised space
    is smaller than the real space, then the function is said to generalise
    the input, mapping more than one input point to each anonymous value
    (e.g., as with aggregation). By definition, generalisation is not
    recoverable.</t>

    <t>If the dimensions of the anonymised and real spaces are the
    same, such that the count of unique values is preserved, then the function
    is said to be a direct substitution function. If the dimension of the
    anonymised space is larger, such that each real value maps to a set of
    anonymised values, then the function is said to be a set substitution
    function. Note that with set substitution functions, the sets of
    anonymised values are not necessarily disjoint. Either direct or set
    substitution functions are said to be one-way if there exists no method
    for recovering the real data point from an anonymised one.</t>

    <t>This classification is summarised in the table below.</t>

      <texttable> 
        <ttcol align="left">Recoverability / Countability</ttcol> 
        <ttcol align="left">Recoverable</ttcol> 
        <ttcol align="left">Non-recoverable</ttcol>
        <c>N < M </c><c>N.A.</c><c>Generalisation</c>
        <c>N = M </c><c>Direct Substitution</c><c>One-way Direct Substitution</c>
        <c>N > M </c><c>Set Substitution</c><c>One-way Set Substitution</c> 
      </texttable>

      </section>

  <section title="Anonymisation of IP Flow Data">

    <t>Due to the restricted semantics of IP flow data, there are a relatively
    limited set of specific anonymisation techniques available on flow data,
    though each falls into the broad categories above. Each type of field that
    may commonly appear in a flow record may have its own applicable specific
    techniques.</t>

    <t>While anonymisation is generally applied at the resolution of single
    fields within a flow record, attacks against anonymisation use entire
    flows and relationships between hosts and flows within a given data set.
    Therefore, fields which may not necessarily be identifying by themselves
    may be anonymised in order to increase the anonymity of the data set as a
    whole.</t>

    <t>Of all the fields in an IP flow record, only IP addresses directly
    identify entities in the real world. Each IP address is associated with an
    interface on a network host, and can potentially be identified with a
    single user. Additionally, IP addresses are structured identifiers; that
    is, partial IP address prefixes may be used to identify networks just as
    full IP addresses identify hosts. This makes anonymisation of IP addresses
    particularly important.</t>

    <t>Port numbers identify abstract entities (applications) as opposed to
    real-world entities, but they can be used to classify hosts and user
    behavior. Passive port fingerprinting, both of well-known and ephemeral
    ports, can be used to determine the operating system running on a host.
    Relative data volumes by port can also be used to determine the host's
    function (workstation, web server, etc.); this information can be used to
    identify hosts and users.</t>

    <t>While not identifiers in and of themselves, timestamps and counters
    can reveal the behavior of the hosts and users on a network. Any given
    network activity is recognizable by a pattern of relative time differences
    and data volumes in the associated sequence of flows, even without host
    address information. They can therefore be used to identify hosts and
    users. Timestamps and counters are also vulnerable to traffic injection
    attacks, where traffic with a known pattern is injected into a network
    under measurement, and this pattern is later identified in the anonymised
    data set. </t>

    <t>The simplest and most extreme form of anonymisation, which can be
    applied to any field of a flow record, is black-marker anonymisation, or
    complete deletion of a given field. Note that black-marker anonymisation
    is equivalent to simply not exporting the field(s) in question.</t>

    <t> While black-marker anonymisation completely protects the data in
    the deleted fields from the risk of disclosure, it also reduces the
    utility of the anonymised data set as a whole. Techniques that retain some
    information while reducing (though not eliminating) the disclosure risk
    will be extensively discussed in the following sections; note that the
    techniques specifically applicable to IP addresses, timestamps, ports, and
    counters will be discussed in separate sections.</t>

    <section title="IP Address Anonymisation">

      <t>Since IP addresses are the most common identifiers within flow data
      that can be used to directly identify a person, organization, or host,
      most of the work on flow and trace data anonymisation has gone into IP
      address anonymisation techniques. Indeed, the aim of most attacks
      against anonymisation is to recover the map from anonymised IP addresses
      to original IP addresses thereby identifying the identified hosts. There
      is therefore a wide range of IP address anonymisation schemes that fit
      into the following categories.</t>

      <texttable> 
        <ttcol align="left">Scheme</ttcol> 
        <ttcol align="left">Action</ttcol> 
        <c>Truncation</c><c>Generalisation</c>
        <c>Random Permutation</c><c>Direct Substitution</c>
        <c>Prefix-preserving Pseudonymisation</c><c>Direct Substitution</c>
      </texttable>

      <section title="Truncation">

        <t>Truncation removes "n" of the least significant bits from an IP
        address, replacing them with zeroes. In effect, it replaces a host
        address with a network address for some fixed netblock; for IPv4
        addresses, 8-bit truncation corresponds to replacement with a /24
        network address. Truncation is a non-reversible generalisation scheme.
        Note that while truncation is effective for making hosts
        non-identifiable, it preserves information which can be used to
        identify an organization, a geographic region, a country, or a
        continent (or RIR region of responsibility).</t>

        <t>Truncation to an address length of 0 is equivalent to black-marker
        anonymisation. Removal of IP address information is only recommended
        for analysis tasks which have no need to separate flow data by host or
        network; e.g. as a first stage to per-application (port) or
        time-series total volume analyses.</t>

      </section>

      <section title="Random Permutation">

        <t>Random permutation is a direct substitution technique, replacing
        each IP address with an address randomly selected from the set of
        possible IP addresses, guaranteeing that each anonymised address
        represents a unique original address. The random permutation does not
        preserve any structural information about a network, but it does
        preserve the unique count of IP addresses. Any application that
        requires more structure than host-uniqueness will not be able to use
        randomly permuted IP addresses.</t>

      </section>

      <section title="Prefix-preserving Pseudonymisation">

        <t>Prefix-preserving pseudonymisation is a direct substitution
        technique, further restricted such that the structure of subnets is
        preserved at each level while anonymising IP addresses. If two real IP
        addresses match on a prefix of "n" bits, the two anonymised IP
        addresses will match on a prefix of "n" bits as well. This is useful
        when relationships among networks must be preserved for a given
        analysis task, but introduces structure into the anonymised data which
        can be exploited in attacks against the anonymisation technique.</t>

      </section>
      
    </section>

    <section title="Timestamp Anonymisation">

      <t>The particular time at which a flow began or ended is not
      particularly identifiable information, but it can be used as part of
      attacks against other anonymisation techniques or for user profiling.
      Presice timestamps can be used in injected-traffic fingerprinting
      attacks [CITE] as well as to identify certain activity by response delay
      and size fingerprinting [CITE]. Therefore, timestamp information may be
      anonymised in order to ensure the protection of the entire dataset.</t>

      <texttable> 
        <ttcol align="left">Scheme</ttcol> 
        <ttcol align="left">Action</ttcol> 
        <c>Precision Degradation</c><c>Generalisation</c>
        <c>Enumeration</c><c>Direct or Set Substitution</c>
        <c>Random Shifts</c><c>Direct Substitution</c>
      </texttable>

      <section title="Precision Degradation">

        <t>Precision Degradation is a generalisation technique that removes
        the most precise components of a timestamp, accounting all events
        occurring in each given interval (e.g. one millisecond for millisecond
        level degradation) as simultaneous. This has the effect of potentially
        collapsing many timestamps into one. With this technique time
        precision is reduced, and sequencing may be lost, but the information
        at which time the event occurred is preserved. The anonymised data may
        not be generally useful for applications which require strict
        sequencing of flows.</t>

        <t>Note that flow meters with low time precision (e.g. second
        precision, or millisecond precision on high-capacity networks) perform
        the equivalent of precision degradation anonymisation by their
        design.</t>

        <t>Note also that degradation to a very low precision (e.g. on the
        order of minutes, hours, or days) is commonly used in analyses
        operating on time-series aggregated data, and is referred to binning;
        though the time scales are longer and applicability more restricted,
        this is in principle the same operation.</t>

        <t>Precision degradation to infinitely low precision is equivalent to
        black-marker anonymisation. Removal of timestamp information is only
        recommended for analysis tasks which have no need to separate flows in
        time, for example for counting total volumes or unique occurrences of
        other flow keys in an entire dataset.</t>

      </section>
      
      <section title="Enumeration">

        <t>Enumeration is a substitution function that retains the
        chronological order in which events occurred while eliminating time
        information. Timestamps are substituted by equidistant timestamps (or
        numbers) starting from a randomly chosen start value. The resulting
        data is useful for applications requiring strict sequencing, but not
        for those requiring good timing information (e.g. delay- or jitter-
        measurement for QoS applications or SLA validation).</t>

      </section>
      
      <section title="Random Time Shifts">

        <t>Random time shifts add a random offset to every timestamp within a
        dataset. This reversible substitution technique therefore retains
        duration and inter-event interval information as well as chronological
        order of flows. It is primarily intended to defeat traffic injection
        fingerprinting attacks.</t>

      </section>
      
    </section>

    <section title="Counter Anonymisation">

      <t>Counters (such as packet and octet volumes per flow) are subject to
      fingerprinting and injection attacks against anonymisation, or for user
      profiling as timestamps are. Counter anonymisation can help defeat these
      attacks, but are only usable for analysis tasks for which relative or
      imprecise magnitudes of activity are useful. </t>

      <texttable> 
        <ttcol align="left">Scheme</ttcol> 
        <ttcol align="left">Action</ttcol> 
        <c>Precision Degradation</c><c>Generalisation</c>
        <c>Binning</c><c>Generalisation</c>
        <c>Random noise addition</c><c>Direct or Set Substitution</c>
      </texttable>

      <section title="Precision Degradation">

        <t>As with precision degradation in timestamps, precision degradation
        of counters removes lower-order bits of the counters, treating all the
        counters in a given range as having the same value. Depending on the
        precision reduction, this loses information about the relationships
        between sizes of similarly-sized flows, but keeps relative magnitude
        information.</t>

      </section>

      <section title="Binning">

        <t>Binning can be seen as a special case of precision degradation; the
        operation is identical, except for in precision degradation the
        counter ranges are uniform, and in binning they need not be. For
        example, a common counter binning scheme for packet counters could be
        to bin values 1-2 together, and 3-infinity together, thereby
        separating potentially completely-opened TCP connections from unopened
        ones. Binning schemes are generally chosen to keep precisely the
        amount of information required in a counter for a given analysis task.
        Note that, also unlike precision degradation, the bin label need not
        be within the bin's range.</t>

        <t>Binning counters to a single bin 0-infinity, or alternately
        precision degradation to infinitely low precision, is equivalent to
        black-marker anonymisation. Removal of counter information is only
        recommended for analysis tasks which have no need to evaluate the
        removed counter, for example for counting only unique occurrences of
        other flow keys.</t> 

      </section>

      <section title="Random Noise Addition">

        <t>Random noise addition adds a random amount to a counter in each
        flow; this is used to keep relative magnitude information and minimize
        the disruption to size relationship information while avoiding
        fingerprinting attacks against anonymisation. Note that there is no
        guarantee that random noise addition will maintain ranking order by a
        counter among members of a set. Random noise addition is particularly
        useful when the derived analysis data will not be presented in such a
        way as to require the lower-order bits of the counters.</t>

      </section>

    </section>

    <section title="Anonymisation of Other Flow Fields">

      <t>Other fields, particularly port numbers and protocol numbers, can
      be used to partially identify the applications that generated the
      traffic in a a given flow trace. This information can be used in
      fingerprinting attacks, and may be of interest on its own (e.g., to
      reveal that a certain application with suspected vulnerabilities is
      running on a given network). These fields are generally
      anonymised using one of two techniques.</t>

      <texttable> 
        <ttcol align="left">Scheme</ttcol> 
        <ttcol align="left">Action</ttcol> 
        <c>Binning</c><c>Generalisation</c>
        <c>Random Permutation</c><c>Direct Substitution</c>
      </texttable>
      
      <section title="Binning">

        <t>Binning is a generalisation technique mapping a set of potentially
        non-uniform ranges into a set of abritrarily labeled bins. Common bin
        arrangements depend on the field type and the analysis application.
        For example, an IP protocol bin arrangement may preserve 1, 6, and 17
        for ICMP, UDP, and TCP traffic, and bin all other protocols into a
        single bin, to mitigate the use of uncommon protocols in
        fingerprinting attacks. Another example arrangement may bin source and
        destination ports into low (0-1023) and high (1024-65535) bins in
        order to tell service from ephemeral ports without identifying
        individual applications.</t>

        <t>Binning other flow key fields to a single bin is equivalent to
        black-marker anonymisation. Removal of other flow key information is
        only recommended for analysis tasks which have no need to
        differentiate flows on the removed keys, for example for total traffic
        counts or unique counts of other flow keys.</t>

      </section>      

      <section title="Random Permutation">

        <t>Random permutation is a direct substitution technique, replacing
        each key value with an value randomly selected from the set of
        possible range, guaranteeing that each anonymised value represents a
        unique original value. This is used to preserve the count of unique
        flow key values without preserving information about the keys
        themselves.</t>

      </section>

    </section>

  </section>

  <section title="Parameters for the Description of Anonymisation Techniques"> 

    <t>This section details the abstract parameters used to describe the
    anonymisation techniques examined in the previous section, on a
    per-parameter basis. These parameters and their export safety inform the
    design of the IPFIX anonymisation metadata export specified in the
    following section.</t>

    <section title="Stability">

      <t>Any given anonymisation technique may be applied with a varying range
      of stability. Stability is important for assessing the comparability of
      anonymised information in different data sets, or in the same data set
      over different time periods. In general, stability ranges from
      completely stable to completely unstable; however, note that the
      completely unstable case is indistinguishable from black-marker
      anonymisation. A completely stable anonymisation will always map a given
      value in the real space to the same value in the anonymised space. In
      practice, an anonymisation may also be stable for every data set
      published by an a particular producer to a particular consumer, stable
      for a stated time period within a dataset or across datasets, or stable
      only for a single data set.</t>

      <t>If no information about stability is available, users of anonymised
      data may assume that the techniques used are stable across the entire
      dataset, but unstable across datasets. Note that stability presents a
      risk-utility tradeoff, as completely stable anonymisation can be used
      for longer-term trend analysis tasks but also presents more risk of
      attack given the stable mapping.</t>

      <!--<t>[EDITOR'S NOTE: are there any other universally applicable
      parameters?]</t>-->

    </section>

    <section title="Truncation Length">

      <t>Truncation and precision degradation are described by the truncation
      length, or the amount of data still remaining in the anonymised field
      after anonymisation.</t>

      <t>Truncation length can be inferred from a given data set, and need not
      be specially exported or protected.</t>

    </section>
    
    <section title="Bin Map">

      <t>Binning is described by the specification of a bin mapping function.
      This function can be generally expressed in terms of an associative
      array that maps each point in the original space to a bin, although from
      an implementation standpoint most bin functions are much simpler and
      more efficient.</t>

      <t>Since knowledge of the bin mapping function can be used to partially
      deanonymise binned data, depending on the degree of generalisation, no
      information about the bin mapping function should be exported.</t>
      
    </section>
      
    <section title="Permutation">

      <t>Like binning, permutation is described by the specification of a
      permutation function. In the general case, this can be expressed in
      terms of an associative array that maps each point in the original space
      to a point in the anonymised space. Unlike binning, each point in the
      anonymised space must correspond to a single, unique point in the
      original space.</t>

      <t>Since knowledge of the permutation function can be used to completely
      deanonymise permuted data, no information about the permutation function
      or its parameters should be exported.</t>

    </section>

    <section title="Shift Amount">

      <t>Shifting requires an amount to shift each value by. Since the shift
      amount can be used to deanonymize data protected by shifting, no
      information about the shift amount should be exported.</t>

    </section>

  </section> 

  <section title="Anonymisation Export Support in IPFIX">

    <t>Anonymised data exported via IPFIX SHOULD be annotated with
    anonymisation metadata, which details which fields described by which
    Templates are anonymised, and provides appropriate information on the
    anonymisation techniques used. This metadata SHOULD be exported in Data
    Records described by the recommended Options Templates described in this
    section; these Options Templates use the additional Information Elements
    described in the following subsection.</t>

    <t>Note that fields anonymised using the black-marker (removal) technique
    do not require any special metadata support. Black-marker anonymised
    fields SHOULD NOT be exported at all; the absence of the field in a given
    Data Set is implicitly declared by not including the corresponding
    Information Element in the Template describing that Data Set; exporting
    "empty" data elements is inefficient and in the general case impossible,
    as many non-counter Information Elements do not have semantically distinct
    null values.</t>
    

    <section title="Anonymisation Options Template" anchor="opt-section">

      <t>The Anonymisation Options Template describes anonymisation records,
      which allow anonymisation metadata to be exported inline over IPFIX or
      stored in an IPFIX File, by binding information about anonymisation
      techniques to Information Elements within defined Templates. IPFIX
      Exporting Processes SHOULD export anonymisation records for any Template
      describing exported anonymised Data Records; IPFIX Collecting Processes
      and processes downstream from them MAY use anonymisation records to
      treat anonymised data differently depending on the applied
      technique.</t>

      <t>An Exporting Process SHOULD export anonymisation records after the
      Templates they describe have been exported, and SHOULD export
      anonymisation records reliably.</t>

      <t>Anonymisation records, like Templates, MUST be handled by Collecting
      Processes as scoped to the Transport Session in which they are sent.
      While the anonymisationStability IE can be used to declare that a given
      anonymisation technique's mapping will remain stable across multiple
      sessions, each session MUST re-export the anonymisation Records along
      with the templates.</t>

       <t>[EDITOR'S NOTE: Multiple anon. techniques applied on an IE at the
       same time is indicated with multiple elements of the same type (in
       application order as in PSAMP). Need to verify this is actually useful
       given the defined techniques.]</t>

      <texttable>
        <ttcol align="left">IE</ttcol>
        <ttcol align="left">Description</ttcol>
        <c>templateId [scope]</c>
        <c>

          The Template ID of the Template containing the Information Element
          described by this anonymisation record. This Information Element
          MUST be defined as a Scope Field.

        </c>
        <c>informationElementId [scope]</c>
        <c>

          The Information Element identifier of the Information Element
          described by this anonymisation record. This Information Element
          MUST be defined as a Scope Field.

        </c>
        <c>informationElementIndex [scope] [optional]</c>
        <c>

          The Information Element index of the instance of the Information
          Element described by this anonymisation record identified by the
          informationElementId within the Template. Optional; need only be
          present when describing Templates that have multiple instances of
          the same Information Element. This Information Element MUST be
          defined as a Scope Field if present. This Information Element is
          defined in <xref target="ie-section"/>, below.

        </c>
        <c>anonymisationStability</c>
        <c>

          The stability class of the anonymised data. MUST be present. This
          Information Element is defined in <xref target="ie-section"/>,
          below.

        </c>
        <c>anonymisationTechnique</c>
        <c>

          The technique used to anonymise the data. MUST be present. This
          Information Element is defined in <xref target="ie-section"/>,
          below.

        </c>
      </texttable>
    </section>
    
    <section title="Recommended Information Elements for Anonymisation Metadata" anchor="ie-section">

      <section title="anonymisationStability">
        <list style="hanging">
          <t hangText="Description: ">

            A description of the stability class of the anonymisation
            technique applied to a referenced Information Element within a
            referenced Template. Stability classes refer to the stability of
            the parameters of the anonymisation technique, and therefore the
            comparability of the mapping between the real and anonymised
            values over time. This determines which anonymised datasets may be
            compared with each other.

            <texttable>
            <ttcol align="left">Value</ttcol>
            <ttcol align="left">Description</ttcol>
       	    <c>0</c><c>Undefined: the Exporting Process makes no representation as to how stable the mapping is, or over what time period values of this field will remain comparable; while the Collecting Process MAY assume Session level stability, Session level stability is not guaranteed. This is equivalent to 0x01 Session level stability while advising the Collecting Process that no special effort has been made to ensure stability. Collecting Processes SHOULD assume this is the case in the absence of stability class information; this is the default stability class.</c>
       	    <c>1</c><c>Session: the Exporting Process will ensure that the parameters of the anonymisation technique are stable during the Transport Session. All the values of the described Information Element for each Record described by the referenced Template within the Transport Session are comparable. The Exporting Process SHOULD endeavour to ensure at least this stability class.</c>
       	    <c>2</c><c>Exporter-Collector Pair: the Exporting Process will ensure that the parameters of the anonymisation technique are stable across Transport Sessions over time with the given Collecting Process, but may use different parameters for different Collecting Processes. Data exported to different Collecting Processes is not comparable.</c>
       	    <c>3</c><c>Stable: the Exporting Process will ensure that the parameters of the anonymisation technique are stable across Transport Sessions over time, regardless of the Collecting Process to which it is sent.</c>
       	  </texttable>

         </t>
       	<t hangText="Abstract Data Type: ">unsigned8</t>
       	<t hangText="ElementId: ">TBD1</t>
       	<t hangText="Status: ">Proposed</t>
       </list>
      </section>

      <section title="anonymisationTechnique">
      	<list style="hanging">
      	  <t hangText="Description: ">

            A description of the anonymisation technique applied to a
            referenced Information Element within a referenced Template.

            <texttable>
            <ttcol align="left">Value</ttcol>
            <ttcol align="left">Description</ttcol>
       	    <c>0</c><c>Undefined: the Exporting Process makes no representation as to whether the defined field is anonymised or not. While the Collecting Process MAY assume that the field is not anonymised, it is not guaranteed not to be. This is the default anonymisation technique.</c>
       	    <c>1</c><c>None: the values exported are real.</c>
       	    <c>2</c><c>Precision Degradation/Truncation: the values exported are anonymised using simple precision degradation or truncation. The new precision is implicit in the exported data, and can be deduced by the Collecting Process.</c>
       	    <c>3</c><c>Binning: the values exported are anonymised into bins.</c>
       	    <c>4</c><c>Enumeration: the values exported are anonymised by enumeration.</c>
       	    <c>5</c><c>Permutation: the values exported are anonymised by random permutation.</c>
       	    <c>6</c><c>Prefixed Permutation: the values exported are anonymised by random permutation, preserving bit-level structure; this represents prefix-preserving IP address anonymisation.</c>
       	  </texttable>

         </t>
       	<t hangText="Abstract Data Type: ">unsigned8</t>
       	<t hangText="ElementId: ">TBD2</t>
       	<t hangText="Status: ">Proposed</t>
       </list>
      </section>      
      
      <section title="informationElementIndex">
       <list style="hanging">
         <t hangText="Description: ">
           A zero-based index of an Information Element referenced by informationElementId within a Template referenced by templateId; used to disambiguate scope for templates containing multiple identical Information Elements.</t>
         <t hangText="Abstract Data Type: ">unsigned16</t>
         <t hangText="ElementId: ">TBD3</t>
         <t hangText="Status: ">Proposed</t>
       </list>
      </section>      
    </section>

  </section>

  <section title="Applying Anonymisation Techniques to IPFIX Export and Storage">

    <t>When exporting or storing anonymised flow data using IPFIX, certain
    interactions between the IPFIX Protocol and the anonymisation techniques
    in use must be considered; these are treated in the subsections below.</t>

    <section title="Arrangement of Processes in IPFIX Anonymisation">

      <t>Anonymisation may be applied to IPFIX data at three stages within a
      the collection infrastructure: on initial export, at a mediator, or
      after collection, as shown in <xref target="loc-fig"></xref>. Each of these
      locations has specific considerations and applicability.</t>

        <figure title="Potential Anonymisation Locations" anchor="loc-fig">
          <artwork><![CDATA[
            
                    +--------------------+
                    | IPFIX File Storage |
                    +--------------------+
                      ^
                      | (Anonymised after collection)
                      |
            +=======================================+
            | Collecting Process                    |
            +=======================================+
              ^                                   ^
              | (Anonymised at mediator)          |
              |                                   |
            +=============================+       |
            | Mediator                    |       |
            +=============================+       |
              ^                                   |
              |    (Anonymised on initial export) |
              |                                   |
            +=======================================+
            | Exporting Process                     |
            +=======================================+
          ]]></artwork>
        </figure>

      <t>Anonymisation is generally performed before the wider dissemination
      or repurposing of a flow data set, e.g., adapting operational
      measurement data for research. Therefore, direct anonymisation of flow
      data on initial export is only applicable in certain restricted
      circumstances: when the Exporting Process is "publishing" data to a
      Collecting Process directly, and the Exporting Process and Collecting
      Process are operated by different entities. Note that certain guidelines
      in <xref target="header-anon"/> with respect to timestamp anonymisation
      may not apply in this case, as the Collecting Process may be able to
      deduce certain timing information from the time at which each Message is
      received.</t>

      <t>A much more flexible arrangement is to anonymise data within a <xref
      target="I-D.ietf-ipfix-mediators-framework">Mediator</xref>. Here,
      original data is sent to a Mediator, which performs the anonymisation
      function and re-exports the anonymised data. Such a Mediator could be
      located at the administrative domain boundary of the initial Exporting
      Process operator, exporting anonymised data to other consumers outside
      the organisation. In this case, the original Exporter SHOULD use TLS as
      specified in <xref target="RFC5101"/> to secure the channel to the
      Mediator, and the Mediator should follow the guidelines in <xref
      target="guidelines"></xref>, to mitigate the risk of original data
      disclosure.</t>

      <t>When data is to be published as an anonymised data set in an <xref
      target="I-D.ietf-ipfix-file">IPFIX File</xref>, the anonymisation may be
      done at the final Collecting Process before storage and dissemination,
      as well. In this case, the Collector should follow the guidelines in
      <xref target="guidelines"/>, especially as regards File-specific
      Options in <xref target="opt-anon"/> </t>

      <t>Note that anonymisation may occur at more than one location within a
      given collection infrastructure, to provide varying levels of
      anonymisation reversal risk and utility for specific purposes.</t>

    </section>
    
    <section title="IPFIX-Specific Anonymisation Guidelines" anchor="guidelines">

      <t>In implementing and deploying the anonymisation techniques described
      in this document, care must be taken that data structures supporting the
      operation of the protocol itself do not leak data that could be used to
      reverse the anonymisation applied to the flow data. Such data structures
      may appear in the header, or within the data stream itself, especially
      as options data. Each of these and their impact on specific
      anonymisation techniques is noted in a separate subsection below.</t>

      <section title="Appropriate Use of Information Elements for Anonymised Data" section="iespec-anon">
        <t>[TODO: reiterate black-marker guidelines here]</t>
        
        <t>[TODO: note that precision degradation SHOULD use appropriately-sized fields]</t>

      </section>

      <section title="Anonymisation of Header Data" anchor="header-anon">

        <t>Each IPFIX Message contains a Message Header; within this Message
        Header are contained two fields which may be used to break certain
        anonymisation techniques: the Export Time, and the Observation Domain
        ID</t>

        <t>Export of IPFIX Messages containing anonymised timestamp data where
        the original Export Time Message header has some relationship to the
        anonymised timestamps SHOULD anonymise the Export Time header field
        using an equivalent technique, if possible. Otherwise, relationships
        between export and flow time could be used to partially or totally
        reverse timestamp anonymisation.</t>

        <t>The similarity in size between an Observation Domain ID and an IPv4
        address (32 bits) may lead to a temptation to use an IPv4 interface
        address on the Metering or Exporting Process as the Observation Domain
        ID. If this address bears some relation to the IP addresses in the
        flow data (e.g., shares a network prefix with internal addresses) and
        the IP addresses in the flow data are anonymised in a
        structure-preserving way, then the Observation Domain ID may be used
        to break the IP address anonymisation. Use of an IPv4 interface
        address on the Metering or Exporting Process as the Observation Domain
        ID is NOT RECOMMENDED in this case.</t>

        <!--<t>[EDITOR'S NOTE: We might want to see if anyone is actually doing
        this with IPFIX. The example comes from other network measurement
        tools (e.g. Argus) which default to using an IPv4 address as a sensor
        ID.]</t>-->

      </section>

      <section title="Anonymisation of Options Data" anchor="opt-anon">

        <t>IPFIX uses the Options mechanism to export, among other things,
        metadata about exported flows and the flow collection infrastructure.
        As with the IPFIX Message Header, certain Options recommended in <xref
        target="RFC5101"/> and <xref target="I-D.ietf-ipfix-file">the IPFIX
        File Format</xref> containing flow timestamps and network addresses of
        Exporting and Collecting Processes may be used to break certain
        anonymisation techniques; care should be taken while using them with
        anonymised data export and storage.</t>

        <t>The Exporting Process Reliability Statistics Options Template,
        recommended in <xref target="RFC5101"/>, contains an Exporting Process
        ID field, which may be an exportingProcessIPv4Address Information
        Element or an exportingProcessIPv6Address Information Element. If the
        Exporting Process address bears some relation to the IP addresses in
        the flow data (e.g., shares a network prefix with internal addresses)
        and the IP addresses in the flow data are anonymised in a
        structure-preserving way, then the Exporting Process address may be
        used to break the IP address anonymisation. Exporting Processes
        exporting anonymised data in this situation SHOULD mitigate the risk
        of attack either by omitting Options described by the Exporting
        Process Reliability Statistics Options Template, or by anonymising the
        Exporting Process address using a similar technique to that used to
        anonymise the IP addresses in the exported data.</t>

        <t>Similarly, the Export Session Details Options Template and Message
        Details Options Template specified for the <xref
        target="I-D.ietf-ipfix-file">IPFIX File Format</xref> may contain the
        exportingProcessIPv4Address Information Element or the
        exportingProcessIPv6Address Information Element to identify an
        Exporting Process from which a flow record was received, and the
        collectingProcessIPv4Address Information Element or the
        collectingProcessIPv6Address Information Element to identify the
        Collecting Process which received it. If the Exporting Process or
        Collecting Process address bears some relation to the IP addresses in
        the flow data (e.g., shares a network prefix with internal addresses)
        and the IP addresses in the flow data are anonymised in a
        structure-preserving way, then the Exporting Process or Collecting
        Process address may be used to break the IP address anonymisation.
        Since these Options Templates are primarily intended for storing IPFIX
        Transport Session data for auditing, replay, and testing purposes, it
        is NOT RECOMMENDED that storage of anonymised data include these
        Options Templates in order to mitigate the risk of attack.</t>

        <t>The Message Details Options Template specified for the <xref
        target="I-D.ietf-ipfix-file">IPFIX File Format</xref> also contains
        the collectionTimeMilliseconds Information Element. As with the Export
        Time Message Header field, if the exported flow data contains
        anonymised timestamp information, and the collectionTimeMilliseconds
        Information Element in a given Message has some relationship to the
        anonymised timestamp information, then this relationship can be
        exploited to reverse the timestamp anonymisation. Since this Options
        Template is primarily intended for storing IPFIX Transport Session
        data for auditing, replay, and testing purposes, it is NOT RECOMMENDED
        that storage of anonymised data include this Options Template in order
        to mitigate the risk of attack.</t>

        <t>Since the Time Window Options Template specified for the <xref
        target="I-D.ietf-ipfix-file">IPFIX File Format</xref> refers to the
        timestamps within the flow data to provide partial table of contents
        information for an IPFIX File, care must be taken to ensure that
        Options described by this template are written using the anonymised
        timestamps instead of the original ones.</t>

        <!--<t>[EDITOR'S NOTE: what about other non-standard templates
        containing the same or similar IEs?]</t>-->

      </section>

    </section>
  </section>

  <section title="Examples">

    <t>[TODO: write this section.]</t>

   </section>

  <section title="Security Considerations">

    <t>[TODO: write this section.]</t>

   </section>
  
  <section title="IANA Considerations">
    <t>This document contains no actions for IANA.</t>
    
    <t>[EDITOR'S NOTE: creation of anonymisationStability and anonymisationTechnique registries may change this.]</t>
    
  </section>

  <section title="Acknowledgments">

    <t>We thank Paul Aitken for his comments and insight, and the PRISM
    project for its support of this work.</t>

  </section>


</middle>   

<back>

  <references title="Normative References">
    &rfc5101;
    &rfc5102;
  </references>

  <references title="Informative References">
    &draftIpfixAs;
    &draftIpfixArchitecture;
    &draftIpfixFile;
    &draftIpfixMedframe;     
    &rfc3917;
    &rfc2119; 
<!--    
    <reference anchor='cryptopan'>
      <front>
        <title>Prefix-Preserving IP Address Anonymization</title>
        <author initials='J' surname='Fan' fullname='Jinliang Fan'>
          <organization />
        </author>
        <author initials='J' surname='Xu' fullname='Jun Xu'>
          <organization />
        </author>
        <author initials='M' surname='Ammar' fullname='Mostafa H. Ammar'>
          <organization />
        </author>
        <author initials='S' surname='Moon' fullname='Sue B. Moon'>
          <organization />
        </author>
        <date month='October' day='7' year='2004' />
        <abstract/>
      </front>

      <seriesInfo name='' value='Computer Networks, Volume 46, Issue 2, Pages 253-272, Elsevier'/>
    </reference>
-->
  </references>

</back>
</rfc>

PAFTECH AB 2003-20262026-04-24 02:40:39