[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[RRG] FLOWv6: IPv6 Flow Label to control DFZ forwarding

To: Routing Research Group <rrg@psg.com>
Subject: [RRG] FLOWv6: IPv6 Flow Label to control DFZ forwarding
From: Robin Whittle <rw@firstpr.com.au>
Date: Thu, 31 Jul 2008 22:39:17 +1000
Organization: First Principles
User-agent: Thunderbird 2.0.0.16 (Windows/20080708)
I woke up this morning thinking:

    The IPv6 Flow Label isn't being used.
      How many bits is it??
         20 . . . . !!!!

This is an attempt to achieve what we are trying to do with
map-encap schemes and with Six/One Router - but not by using
tunneling or forwarding.

Instead, an IFR (Ingress Flow Router - like an ITR) uses mapping
info to set the packet's Flow Label, which is then used by somewhat
modified BGP routers in the DFZ to forward the packet across the DFZ
to the EFR (Egress Flow Router - like an ETR).

This proposal is is about 14 hours old, and I am keen to work with
other people on it.  Please let me know any concerns and suggestions
for improvements.  The text below is a rough first draft and is
written as from "we", since I hope to have some collaborators on
this project.

The advantages seem to include:

 1 - The packets do not get any longer.

 2 - Therefore, no extra length problems adding to PMTUD
     (Path MTU Discovery) difficulties.

 3 - Therefore, no extra packet-length overheads, which are
     particularly pernicious for IPv6 map-encap of VoIP packets.
     See examples at the start of:

        http://psg.com/lists/rrg/2008/msg02034.html

 4 - The packet and its addresses are not changed at all.

 5 - This means that Traceroute and ordinary PMTUD should work
     without any special measures, all the way through the
     "Flow Label forwarded" part of the path.

 6 - There is no problem with header checksums, crypto protocols
     etc. since AFAIK the Flow Label is is not used in any header
     checksums.

 7 - The system relies on significant, but probably not overly
     complex modifications to the DFZ routers' FIB and RIB.
     The process of using the 20 bit Flow Label to directly look
     up the FEC (Forwarding Equivalence Class) is much less
     expensive than using Tree-Bitmap to chomp through up to
     48 bits of address.

 8 - The system should support the TTR approach to mobility just
     like Ivip, with TTR to Mobile Node tunneling remaining, and
     the ITR to TTR tunnels replaced by Flow Label forwarding.

 9 - No host changes or changes to the BGP protocol.

10 - IFR and EFR functionality is much simpler than of Ivip ITRs
     and ETRs - which is simpler than that of other map-encap
     schemes.


The main disadvantages seem to be:

 1 - Can't work for IPv4.

 2 - Requires significant, but probably not too complex, changes
     to the FIB and RIB of most DFZ routers.  Hopefully this can
     be done with state-of-the-art routers simply with firmware
     changes, since the FIBs of recent Cisco and Juniper routers
     are based on CPUs.

 3 - Requires a change to the formal semantics of what is currently
     called the "Flow Label" - which should perhaps be renamed,
     since it is being used for purposes other than "Flow".


I suppose this is also a "Locator / Identification Separation"
solution, but I never liked that term - and it most certainly does
not involve any new locator or separator namespace, at least for IP
addresses.  The 20 bit Flow Label does involve a new
namespace.


FWIW, I still think IPv6 addresses are about 64 bits too long.  This
bloats all the headers, involves even 64 bit CPUs shovelling two
long words around - just for an IP address!

However, if an approach such as FLOWv6 is successfully implemented,
this would overcome my biggest objection to IPv6: that every DFZ
router needs to chew its way laboriously through up to 48 bits of
destination address for every packet it handles.

   - Robin



FLOWv6
======

   A scalable routing and addressing proposal for IPv6, using
   the 20 bit Flow label and modified BGP routers to forward
   packets across the DFZ to the EFR (ETR) - rather than
   using encapsulation or translation.

Robin Whittle and <Your Name too??>


Introduction
------------

Most potentially practical solutions to the routing scaling problem
involve map-encap (LISP, APT, Ivip and TRRP), which adds to the
length of packets as they are tunnelled from an ITR to an ETR.  This
presents efficiency problems and requires some complex handling of
problems arising from disruption of Path MTU Discovery (PMTUD)
mechanisms.

An alternative to encapsulation is translation: rewriting the source
and destination addresses of packets as they enter and leave the
end-user networks which use the new kind of scalable, multihomable,
portable, address space.  Six/One Router is the only current
proposal which uses translation.  Translation avoids the
inefficiencies of the extra headers used by map-encap.  Translation
does not extend the packet lengths - which should result in fewer
PMTUD problems than with map-encap.

However, rewriting the source and and destination addresses can
cause problems for encrypted protocols, and seems likely to disrupt
conventional PMTUD mechanisms.  For the latest version of Six/One
Router, and a tentative critical review, see:

  http://psg.com/lists/rrg/2008/msg02034.html


FLOWv6 is neither a translation nor a map-encap scheme.  It does not
directly resemble MPLS and it does not concerned the conventional
concept of a "Flow".

FLOWv6 aims to achieve the goals of a map-encap scheme such as Ivip,
but without encapsulation, and ideally without disrupting PMTUD.  In
the following description, FLOWv6 replicates some characteristics of
Ivip which distinguish it from the other map-encap schemes,
including fast push of mapping information for real-time control by
end-users.  This makes the functions of multihoming monitoring and
decision making for service restoration completely separate from
FLOWv6 - as it is from Ivip.  These functions are monolithically
integrated into LISP, APT, TRRP and Six/One Router.

If this proposal goes ahead, it might be best to rename the 20 bits
in the IPv6 header to something other than "Flow Label" - so this
proposal and some of its terms would need to be renamed too.

Encapsulation for IPv6 is even more undesirable an operation than
with IPv4, since it adds at least 40 bytes to each packet.  This is
for simple IP-in-IP encapsulation, as used by Ivip - while for IPv4
the overhead is 20 bytes.  For IPv6 map-encap with LISP: IP, UDP and
LISP headers, the overhead is 56 bits.  With traffic volumes
multiplying rapidly, and potentially hundreds of millions of people
sending 50 VoIP packets a second, each with typically 20 bytes of
payload, it is highly desirable to avoid encapsulation in the
scalable routing solution for IPv6.


The IPv6 header has a 20 bit "Flow Label" which is currently not
used for any substantial purpose.  Rather than encapsulate a packet
in order to tunnel it to an ETR address, FLOWv6 involves BGP routers
in the core forwarding each packet according to a route associated
with the value of the packet's Flow Label.  As far as we know, the
Flow Label is not used in any checksums or cryptographic protocols,
so there should be no unwanted effects when its value is changed.

If we are prepared to contemplate the following, then FLOWv6 or some
similar approach should be considered as a solution for the IPv6
routing scaling problem:

1 - We would replace the current semantics of the Flow Label:

      http://tools.ietf.org/html/rfc2460#appendix-A

    with some new specification, applicable always in the
    "core" - the BGP routers of IPv6's inter-domain routing
    system - and probably generally applicable inside
    conventional and new networks as well.

2 - The functions of the control plane and forwarding plane of
    core routers would need to be modified - in relatively
    minor ways.  The control plane changes would surely be
    feasible via firmware upgrades.

    The forwarding plane changes could be quite minimal.
    If we assume that all (or a sufficient proportion of)
    core routers at the time of FLOWv6 introduction will use
    CPU-based FIBs which can be enhanced with firmware
    updates, then these changes are probably feasible.

    The changes are to the way the router's RIB controls
    the FIB, and how the FIB works.  There is no change to
    the BGP protocol, so the upgraded routers will work
    fine alongside non-upgraded routers.  It should be
    possible to run FLOWv6 globally even when most, but not
    necessarily all, IPv6 BGP routers are upgraded.

3 - We would need to be happy with an approximately 1 million
    limit on the number of separate, BGP-managed, prefixes
    in which the new kind of end-user networks could be
    connected.  This is not as restrictive as having no more
    than a million ETRs in a map-encap system.  It is a limit
    on the number of separate provider network sites where one or
    more EFRs (like ITRs) could be located.

Since the installed base of IPv6 core routers is relatively low and
since the Flow Label is not widely used, the above conditions seem
reasonable to us.

Changing BGP implementations and FIB functions is a non-trivial
step, however the result will be all core routers using a highly
regularised 20 bit field for forwarding, which is generally much
more attractive than the current situation in which an algorithm
such as Tree-Bitmap is laboriously executed some number of bits at a
time on the destination address of each packet.  This typically
involves parallel CPUs and expensive DRAM lookups in order to step
through up to 48 bits of address, before the FEC (Forward
Equivalence Class) of the packet can be determined.

There may well be hash and cache approaches to avoid this effort on
every packet, but the use of the Flow Label will be simpler than
those too.

Because of this, we believe that FLOWv6 has the potential to greatly
ease the IPv6 FIB workload in the DFZ.

As a routing scaling solution, like map-encap schemes or Six/One
Router, the aim is to keep the BGP system handling primarily
prefixes for providers - with end-users having address space which
provides their needs in a manner which does not add to this number
of prefixes, or otherwise burden the control plane of the BGP
system.  These needs include portability between ISPs, multihoming
and traffic engineering.


Detailed Description
====================

The following is a preliminary attempt at describing a proposal we
have just started to develop.  We expect that there will be many
suggestions which will improve it.  Please let us know your concerns
and suggestions.  There may well be some gotchas or showstoppers we
haven't considered.


Terminology
-----------

   Conventional networks

         Provider and end-user networks whose address prefixes
         are managed exactly as they are today - by advertising
         each one in the global BGP system.

         Conventional networks which have no IFRs (Ingress
         Flow Routers) are known as "non-upgraded" networks.


   SEN - Scalable End-user Network

         A network for an end-user using the new SPI (defined
         below) form of address space.  In order to solve the
         routing scaling problem, we need to have most or all
         new end-user networks, and many, most or all existing
         (conventional) end-user networks adopt SPI space.

         Therefore we need to make this new form of address space
         and type of network highly attractive to the great
         majority of end-users, of all sizes - including for
         instance corporations, universities, schools and large
         hosting companies.

         All SENs either have their own IFRs (defined below)
         or are connected to the Net via one or more
         conventional networks which provide IFRs to
         handle their outgoing packets.  So the term "non-
         upgraded network" only applies to a conventional
         network without IFRs - never to a SEN.


   SPI - Scalable PI - address space

         A new form of address space, intended solely for end-
         user networks (all networks other than those of
         Internet Service Providers) which is Provider
         Independent, but in a manner which supports scalable
         routing.

         Conventional PI prefixes are each globally advertised
         in the BGP system.  The large number of these
         prefixes, and their rate of change, is the cause of
         the routing scaling problem.

         SPI address space remains stable for each end-user
         network, no matter which one or more ISPs they use
         to connect to the Net.  SPI space is therefore
         entirely portable and can be used for multihoming.

         Ideally SPI space can also be used for inbound
         Traffic Engineering too.  In the current Ivip-like
         description of FLOWv6, inbound TE is achieved
         indirectly within certain limits, rather than with
         the explicit load balancing arrangements of the
         other map-encap schemes.  Nonetheless, both Ivip and
         FLOWv6 enable very fine-grained control of mapping
         with real-time user control - and this may result in
         inbound TE which is superior to that possible with
         the other non-real-time control map-encap schemes.

         For IPv4, translation schemes are not suitable and
         there is no 20 bit flow label in the IPv4 header -
         so SPI IPv4 space is likely to be provided with a
         map-encap scheme.

         In the context of what follows, "SPI space" refers to
         the new kind of address space provided by FLOWv6.


   Micronet

         A contiguous sequence of IP addresses - of the new
         SPI type of address space - which are mapped to a
         single "locator" address.  In most map-encap schemes,
         the micronet concept is implemented as an EID
         (Endpoint IDentifier) prefix.

         In Ivip and FLOWv6, a micronet is not necessarily a
         binary-boundary prefix.  In FLOWv6 it is an integer
         number of contiguous /64 prefixes.  The granularity
         of the mapping system in FLOWv6 is one /64.

         In the mapping system, a micronet is specified by
         a starting address (64 bits) and a length, in
         /64 steps.  In principle, the length may be up
         to 64 bits, however we may limit it to 32 bits.


   UAB - User Address Block

         This is a contiguous range of addresses which are
         controlled by one end-user.  An end-user may be
         as large as a corporate or university network, or
         simply an individual who has a mobile device,
         such as a cellphone.

         UABs are integer numbers of /64s, just like
         micronets.  They are specified by a 64 bit
         starting address and a 64 bit length.  IFRs,
         EFRs (described below) and the mapping
         distribution system do not use UABs.  A UAB
         is an administrative construct.

         End users can divide their UAB into as many
         micronets as they like, and each micronet
         can be mapped to any 128 bit IP address -
         the address of an EFR (Egress Flow Router),
         which forwards the packet to the destination
         SEN network.  So a single UAB could be used to
         create multiple micronets, and each micronet
         could be mapped to a different EFR, in any ISP
         in any country.


   MAB - Mapped Address Block

         A BGP advertised prefix in which the enclosed address
         space is managed by the scalable routing system: for
         instance map-encap or in this case FLOWv6.

         While technically a single MAB could provide space
         for just one SEN, this would help little - or not at
         all - with the routing scaling problem.

         Generally, each MAB should be relatively large compared
         to the size of micronets.  (That said, some SENs may
         need only a single micronet of /64, and others may require
         many more, much larger, micronets - so there isn't a
         typical size of micronet.)

         Generally, each MAB should include a large number of
         micronets, such as hundreds or millions of them.  This
         will enable the micronets serving the needs of very large
         numbers of SEN end-user networks to be handled from an
         area of the address space which requires a single BGP
         advertisement.


   IFR - Ingress Flow Router.

         Similar in function to a map-encap ITR (Ingress
         Tunnel Router).

         The most obvious location for IFR function to be
         implemented is at the border routers - BGP routers
         at the borders of all conventional networks which
         SENs use to connect to the Net.

         The IFR function processes all packets whose
         destination address falls within a micronet, to
         set the Flow Label bits to a value which uniquely
         identifies the BGP advertised prefix towards which
         this packet should be forwarded by all BGP routers
         in the inter-domain core.  (Full explanation below.)

         IFRs can be servers or dedicated routers.  They
         can be located inside conventional networks - they
         need not be a BGP router at the border of a
         conventional network.

         IFRs can also be located inside SEN networks and
         are likely to be found at the border of an SEN
         network and the one or more conventional provider
         networks which the SEN networks uses to connect to the Net.

         The third place an IFR function can be found is,
         in effect, in the DFZ - where it is known as an OIFRD.

   IFRC - IFR with Cache

         IFRs are typically caching IFRs: IFRCs.  They cache
         the mapping information they currently require,
         and do not attempt to store a copy of the entire
         mapping database.

   IFRH - IFR function in sending Host

         A caching IFR function can also be built into a
         sending host.  This could be a zero cost approach
         reducing or eliminating the need for separate IFRs.

         All caching IFRs need an address which be reached
         from anywhere, so they can receive mapping updates
         from query servers.  Like IFRCs, IFRDs can be on
         conventional addresses or SPI addresses - but not
         behind NAT.

   OIFRD - Open Ingress Flow Router in the DFZ.

         These are the the FLOWv6 equivalent of Ivip's OITRDs
         - which do much the same job as LISP's PTRs (Proxy
         Tunnel Routers).

         OIFRDs are distributed around the Net, conceptually
         "in the DFZ" to attract and process packets sent to
         micronet addresses by hosts in "non-upgraded"
         conventional networks: those which have no IFRs of
         their own.

         In fact, OIFRDs are within or at the border of some
         conventional AS network.  Likely locations are
         Internet exchanges, peering points etc.

         They are ideally close to non-upgraded networks, so the
         total path travelled by the packet from its source,
         through the OIFRD and to the EFR is not much longer, or
         the same distance, as the most direct path from the
         sending host to the EFR which serves the destination host.

         Ideally, in the future, all conventional IPv6
         networks will have their own IFRs and OIFRDs will not
         be needed.

         An OIFRD advertises one or more MABs and so attracts
         packets sent from nearby non-upgraded networks.
         It then does what all IFRs do: use mapping
         information to set the Flow Label of the packet so
         that (upgraded) BGP routers will forward the packet
         towards the EFR to which this micronet is currently mapped.

         The business case for OIFRDs is identical to that for
         Ivip OITRDs:

            http://psg.com/lists/rrg/2008/msg02021.html



   EFR - Egress Flow Router

         This is analogous to the Egress Tunnel Router (ETR)
         in a map-encap scheme.

         FLOWv6 uses the Flow Label so the core BGP routers -
         and perhaps internal routers between the BGP border
         router and the EFR - will forward the packet to this
         EFR.  This is somewhat more elaborate and flexible
         than in the map-encap system and is described more
         fully below.

         There is no encapsulating header to remove (map-
         encap), and no addresses to rewrite (Six/One Router).

         The EFR will probably zero the Flow Label.  The most
         important function it performs is that it recognises
         from the destination address which SEN network the
         packet should be forwarded to.  (The destination
         address has been ignored by BGP routers, due to them
         using the Flow Label to decide forwarding.)

         When the IFR uses the packet's destination address
         to look up the mapping information for the micronet
         which covers that address, the return value is the
         exact 128 bit address of the EFR.  Below we explain how
         this enables the IFR to set the Flow Label so that all
         core BGP routers will forward the packet towards the
         one or more BGP border routers which advertise the
         prefix in which the EFR is located.


   FPER - Flow Path Exit Router

         This is fully described below.  It is a BGP router
         at the border of a network which houses one or more
         EFRs.  So this FPER router advertises one or more EPB
         prefixes (next item).

         This router performs a mapping lookup on the
         destination address and ensures the packet is
         forwarded to the correct EFR.  The Flow Label caused
         the packet to be forwarded to this router, across the
         core, but the Flow Label alone cannot determine which
         of two or more EFRs in this network the packet
         should be forwarded to.


   EBP - EFR Block Prefix

         As described more thoroughly below, the IPv6 address
         space is administered to create a regular series of
         prefixes, each of which can be advertised in BGP.

         There are 2^20 such prefixes: 1,048,576.  Each has
         the same length, say /32.  /48 would probably be fine
         too.

         A conventional provider network which has one or more
         EFRs and a single "site" (such as a network in a city,
         or a data centre) needs one EBP.  If it has multiple
         such sites and does not want to ferry traffic between
         them which addressed to EFRs, then it needs a separate
         EBP for each such site.

         All EFRs are located on addresses within one of these
         EBPs.  Below we discuss administrative arrangements
         for this limited resource of about a million EBPs.

         It may not be necessary to use the full number at any
         time in the future.  Perhaps a few tens of thousands
         will be all that is required for the foreseeable
         future, assuming IPv6 is widely adopted.


   Mapping system

         This is a system by which end-users can issue
         commands which change their micronets' starting
         points and addresses and by which they can change
         the 128 bit EFR address to which each micronet is
         mapped.

         The primary task of the mapping system is to enable
         the IFRs to rapidly and reliably find out what EFR
         address any incoming packet should be forwarded to.

         An "incoming packet" in this context means a packet
         the IFR has received and identified as having a
         destination address may be within a micronet -
         by virtue of the fact that the address is within
         one of the MABs.  (In the example below, this is
         easy to determine, since all the MABs - and no
         other types of BGP advertised prefix - are in
         4::/3.)

         Various mapping systems could be used, such as the
         pure pull systems of LISP-ALT and TRRP, the pure
         push system of LISP-NERD, or the hybrid push-pull
         systems of APT (slow) or Ivip (fast).

         The following discussion assumes the use of Ivip's
         mapping system, as described for that proposal.
         This is a fast-push global system of Replicators
         which conveys end-user mapping commands to full-
         database Query Servers located in conventional
         networks which have ITRs (IFRs for FLOWv6).

         These local query servers quickly and reliably
         provide responses to mapping queries from IFRs
         in that network, or from nearby networks, including
         from SEN networks which use this conventional network
         for access to the Net.  This avoids the delay and
         reliability problems of the global query server
         approaches LISP-APT and TRRP, while not requiring
         every ITR/IFR to carry a copy of the full mapping
         database (LISP-NERD).

         The full database query server issues map reply
         messages securely to the querying IFR, with a
         caching time, such as 10 minutes.  During that
         time, the ITR is assumed to be Flow labelling
         (encapsulating and tunneling for Ivip) packets
         which are addressed to this micronet.

         If the query server is told by the mapping system
         of changed mapping for this micronet (or that the
         micronet has been deleted) then it needs to send a
         Cache Update command (AKA "Notify" command to that
         IFR.

         This hybrid push-pull system ensures all ITRs which
         need the mapping information get it within a few
         seconds of the end-user issuing the mapping change
         command.

         Please refer to the overall Ivip Summary and Analysis
         documentation and the Ivip Fast Push Internet Draft
         for further details:

            http://www.firstpr.com.au/ip/ivip/


   QSD - Query Server with full Database

         QSDs get the full continual feed of mapping updates
         from the fast push mapping system.

         They handle queries from nearby IFRs - IFRCs and
         IFRHs.  (An IFRD is really a caching IFRC with an
         integrated QSD, or an IFRC using a QSD in the same,
         rack connected directly by Ethernet.)


   QSC - Query Server with Cache

         These can optionally be deployed, so there may be one
         or more layers of QSCs between IFRCs/IFRHs and the
         nearest one or several QSDs.

         When a QSC has no cached information which answers
         a query, it pass the query upwards to (or towards,
         via one or more QSCs) the nearest local QSD.

         When the QSC receives the response, it caches it and
         sends the response downwards to (or towards, via one
         or more QSCs) the IFRC/IFRH which made the request.

         Likewise, when a QSC gets a Cache Update message from
         a QSD above it (perhaps via one or more QSCs), it
         passes it downwards to whatever IFRCs, IFRHs or QSCs
         below it which, in the last 10 minutes (for instance)
         queried the mapping for this micronet.



Tutorial by way of example
==========================

For simplicity, we assume that all core IPv6 routers have been
upgraded for FLOWv6.  In a section below we discuss transition
arrangements while not all routers have FLOWv6 upgrades.

We will also ignore OIFRDs in this explanation - the IFRs which
collect and Flow Label packets sent from non-upgraded networks and
which are addressed to micronet addresses.

In this example, EBPs are /32s and a prefix E00::/12 has been
reserved for them.  Consequently, the first few EBPs are:

 EBP-0  E000:0000::/32
 EBP-1  E000:0001::/32
 EBP-2  E000:0002::/32
 EBP-3  E000:0003::/32

and the highest is:

 EBP-1048575  E00F:FFFF::/32


So far, 8191 have been allocated, in principle only to operators of
provider networks.  EBPs are only needed by a network which hosts
EFRs, and any such network is inherently providing Internet access.
 Technically, the EBPs could be allocated to operators of
conventional end-user networks, but then those would not truly be
end-user networks any more.

EBP-0 is reserved. (The final design may reserve more low numbered
or high-numbered EBPs for other purposes.)  In our example, the
allocated EBPs include:


 EBP-0001  E000:0001::/32     ISP-A (has only one "site")

 EBP-0002  E000:0002::/32   } ISP-B (has 30 "sites")
 EBP-0003  E000:0003::/32   }
 ...                        }
 EBP-0031  E000:001F::/32   }

 EBP-0032  E000:0020::/32   } ISP-C (has two "sites")
 EBP-0033  E000:0021::/32   }


It is not desirable to have a million EBPs, since each is advertised
in BGP and so places a burden on the entire core routing system.
EBPs are only allocated to organisations which need them, and pay
for them.  (At a later date we will develop plans for administering
these EBPs and for the commercial and regulatory aspects of FLOWv6.)

The ISPs generally have other "conventional" prefixes, outside this
special EBP set - as they do today.  The ISPs use these
"conventional" prefixes for their own internal purposes, and for
some of their customers.  Those customers use the space in today's
"PA" manner.  Whether they get a single IP address or a prefix, and
whether they get it for a short dial-up or mobile session, of for
some period of years, the space they get is only available as long
as they use this ISP.  It is "PA" - Provider Assigned - space and
therefore not portable to other ISPs.

These conventional prefixes and their PA usage has nothing to do
with the SPI space provided by FLOWv6.

We will consider two end-user networks with SPI space:  Net-X and
Net-Y.  For simplicity of explanation, these micronets are from the
same MAB.

In our example, the prefix 4::/3 has been reserved for MAB prefixes.
 It is not absolutely necessary for all MABs to be in any reserved
prefix such as this, but it would simplify the functionality of IFRs
and EFRs.

In IPv4, for a map-encap system, there is no chance of making all
the MABs appear in some clearly defined subset of the whole address
space - since, over the next five to ten years, there needs to be
progressive conversion of a great deal of the whole address space
into MABs.

In IPv6, by administrative fiat, it would be easy for the IANA to
carve out two special prefixes which would make the FLOWv6 system
simpler to implement.  In addition to the above-mentioned E00::/12
reservation for 2^20 EBP prefixes, in our example, the IANA reserves
1/8 of the entire IPv6 address space for MABs: 4::/3 .

  (See http://www.iana.org/assignments/ipv6-address-space
   for current assignments.)

Some company D - probably, but not necessarily an ISP or an RIR -
has been assigned the MAB:

  4000:0050::/24

There could be 2 million MABs of this size in the 4::/3 reservation.
 MABs don't necessarily need to be of the same size, or have no gaps
between them.  It probably makes sense to standardise the size of
all MABs to be all the same - a simplifying convenience which can't
be done in the crowded IPv4 space.

We don't want tens of millions of MABs.  Ideally, we probably want a
few dozen or at most a few hundred.  Each MAB will have its own
stream of mapping updates.  Each OIFRD will advertise one or more -
or potentially all - active MABs.

D rents some of this MAB's space to Net-X and Net-Y.  This rental is
effectively permanent.  Unless D goes broke (in which case the space
would be taken over by another company such as D and probably
administered to preserve the previous assignments), X and Y can have
their space for as long as they like.

Both Net-X and Net-Y pay D for their space, such as a certain fee
per year for each /64.  They also pay D for the mapping changes they
make.  This would probably be a charge per update, or some flat fee
for a certain number of updates per month.

In this fast-push mapping distribution system, it is important that
end-users pay for the updates they send on the system.  The fee may
be as low as a few cents per update.  These fees help pay for most
of the fast-push system, especially the Launch servers and
Replicators.  This occurs through company D and others like it, who
directly or indirectly pay for the operation of the fast-push system.

The fee per update also discourages "excessive" use - such as
changing the mapping ever few seconds for months on end - to
implement fancy TE, or just to create annoyance.  Each mapping
change involves a small amount of computation, storage and
communications bandwidth in the entire fast-push system and in all
recipient QSDs.

The cost will be very low, and it should still be low enough that
end-users with busy networks will find it attractive to use frequent
mapping changes to fine-tune the inbound TE of their multiple links.
 The space of a network would be split into separate micronets, each
with some recipient hosts.  By dynamically changing the EFR each
micronet is mapped to, the incoming traffic volume can be managed in
real-time and directed as desired to each of the two or more EFRs
and so via each of the two or more links from the two or more ISPs.


Net-X and Net-Y also pay D for D's operation of a global network of
OIFRDs which handle packets addressed to the above-mentioned MAB,
sent by hosts in non-upgraded networks.  This means that Net-X and
Net-Y will probably pay according to traffic flowing through the
OIFRDs which was addressed to each end-user's micronets.

This is because one SPI end-user network might have only a small
amount of space, perhaps just a single micronet of /64, but could
run a very popular web site on it, and so generate far more OIFRD
traffic than another end-user network, which has much more space.

D would have a sampling system to estimate OIFRD traffic, it would
not make sense to count every byte.

In the following examples, ordinary IPv6 prefix notation will be
used to show the base address and length of each micronet, but in
practice the micronets can start and end at any /64 boundary.

Net-X has the micronet:

  4000:0050:7000::/48

This is 65,536 contiguous /64s:

    4000:0050:7000::
to  4000:0050:7000:FFFF:FFFF:FFFF:FFFF:FFFF.

This sounds like quite a large micronet, but it is technically valid
and perhaps there will be call for such micronets.

Net-Y's micronet has just two /64s:

  4000:0050:9999:6666::/63

Micronets and UABs can range from a single /64, in principle to as
many /64s as fit in the MAB.  In this case, the /24 MAB covers 1.02
trillion /64s.

Before depicting the passage of a packet through the FLOWv6 system,
we will describe the function of the EBP prefixes.

While an ISP could use space within an EBP prefix for any purpose,
here we assume that all ISPs use these prefixes solely for EFRs.

Our example involves two EBP prefixes:

 EBP-0001  E000:0001::/32     ISP-A

 EBP-0003  E000:0003::/32     ISP-B's "Site-2".


ISP-A advertises its EBP-0001 from a single border router.

ISP-B advertises its EBP-0003 from two border routers at its second
site.

The BGP system treats these EBP prefixes exactly the same as
ordinary BGP prefixes.  All BGP routers therefore develop and
maintain best paths for both these prefixes, and likewise for all
the other EBP prefixes.

The enhanced BGP RIB functionality specifically recognises this set
of 8191 or whatever EBP prefixes, due to the fact they are within
the IANA defined prefix of E00::/12.

The new RIB function is programmed to detect each such /32 EBP
prefix, and to copy its FEC value (the internally value by which the
router's FIB knows which interface to forward the packet from) to a
special array in the FIB.  This is the FINDEX[] array.

FINDEX[] is indexed 0 to (2^20 - 1).

Each element in FINDEX[] stores a FEC value, copied straight from
the FEC of the corresponding EBP in the RIB.

So in a given core router, if the BGP RIB has decided that the best
path towards ISP-A's EBP-0001 is "Interface 3", then the FEC value
which represents "Interface 3" is copied to the location 1 in FINDEX[].


With FLOWv6, it is required that all packets being handled in the
BGP core have their Flow Label set according to the following rules:

   Set to 0 if the packet has not had its Flow Label set to a
   particular value by any IFR.

   Any non-zero value is assumed by all core routers (we
   assume in this example they are all upgraded to FLOWv6
   functionality) to represent the fact that this packet's
   destination address is for a micronet which is currently
   mapped to some EFR whose address is within a particular
   EBP - where this EBP is directly specified by the value
   of the Flow Label.


Having set the stage, we now provide an example packet flow, a
packet sent by a host HA to another host HB.

HA is on a conventional address in some ISP's BGP advertised prefix,
or in a conventional PI space end-user network.

HB is in Net-X's /48 micronet mentioned above:

  4000:0050:7000::/48

HB's address is 4000:0050:7000:1234::33.

Net-X is currently using ISP-B's second site for Internet access,
and the address of  the EFR incoming packets should be forwarded to
(via FLOWv6's Flow Label direct forwarding system, described below) is:

  E000:0003:0000:0055::7

The packet is sent by HA and forwarded by its network's internal
routing system towards a border router, which also has IFR
functions.  The IFR function recognises it as being addressed to
somewhere in the SPI (Scalable PI) address space, since all such
space is defined to be within a micronet - and since all micronets
are within MABs and all MABs within the prefix 4::/3, as is this
packet's destination address.

In our example IFR has no cached mapping information for this
address.  A subsequent packet from HA to HB will have a less complex
process, due to the presence of cached mapping data in the IFR's FIB.

When the packet is analysed by the FIB, the result is of the form:

    This packet is addressed to a section of the address space
    which is known to be covered by the FLOWv6 scheme, but the
    FIB currently has no mapping information for this particular
    address.

    Therefore, hold the packet and query the routing processor
    to ask for the mapping information.  Later, when this
    arrives, the packet will have its Flow Label set and then
    will be forwarded to a BGP router in the core.

    Subsequent packets matching the micronet which was
    specified in the mapping reply will be handled by a
    faster, FIB-only, process which sets the Flow Label
    to the same value, and again forwards the packet to the
    core.

This is one of four initial responses the FIB could produce.  The
other 3 are listed here:

   http://psg.com/lists/rrg/2008/msg02029.html

Briefly, they are:

    Send the packet conventionally, with a normal FIB lookup
    of its destination address

    Use cached mapping information for this packet's
    destination prefix to set the Flow Label, as above,
    before forwarding the packet to the core.

    Drop the packet or process it via via some slower
    and more arduous mechanism - which is not needed for
    FLOWv6.

Once in the core the packet is handled by one or more upgraded BGP
routers.

In our example, the IFR requests mapping information for the
packet's destination address:

  4000:0050:7000:1234::33

Actually, since the mapping system's granularity is /64, the map
request is for the 64 bit value, in hex:

  4000 0050 7000 1234

Within a few tens of milliseconds, the response from the local QSD
(full database query server) comes back to the effect:


  [  The queried address is within the micronet:
  [
  [    4000:0050:7000::/48
  [
  [  which is currently mapped to the EFR at:
  [
  [    E000:0003:0000:0055::7
  [
  [  Cache this response for 600 seconds.


The FLOWv6 section of the IFR's RIB caches this information, and
processes it into a form to be sent to the FIB:


  {  Any incoming packet matching:
  {
  {    4000:0050:7000::/48
  {
  {  should have its Flow Label set to:
  {
  {    (hex) 0 0003
  {
  {  and should then be handled by the
  {  usual forwarding mechanism.

The RIB sends this to the FIB, and by one means or another the FIB
matches the stored packet to this new rule.  (600 seconds later, the
RIB will tell the FIB to delete the above rule.)

Now the packet has its flow label set to (hex) 0 0003 and the FIB's
forwarding mechanism (enhanced to do this FLOWv6 additional
function) looks a its Flow Label, discovers it is 3, and uses this
to index into the array FINDEX[].

This produces the correct FEC value for this packet - the number
which will cause it to be sent out the interface which leads to the
BGP router which is the best path towards the prefix in which the
EFR is located.

Once it reaches that router, the same process happens:

  Is the Flow Label != 0?

  Yes: use it to index into FINDEX[] to retrieve FEC.

  Forward according to this FEC value.

This process is repeated for as many DFZ routers which the packet is
forwarded to, until it reaches a BGP router at the border of the
provider network in which the EFR is located.

This will be very much faster and simpler than the usual process of
analysing up to 48 bits in the destination address with the
Tree-Bitmap algorithm.

In this way, as long as the packet is handled by an upgraded BGP
router, it will be forwarded towards one of the border routers of
ISP-B's Site-2.

Note that the packet does *not* contain any address which refers to
the prefix advertised by ISP-B's Site-2:

   EBP-0003  E000:0003::/32


The Flow Label was set just once by the IFR in the source site.
Once set, the packet is easily handled by (upgraded) BGP routers.

When the packet reaches the border router for ISP-B's Site-2, that
router performs a somewhat different operation, because its FEC
value in its FINDEX[3] selects an interface which does not point to
any BGP router in the core.  This FEC value leads to some internal
router.

Because of three things:

  1 - The next hop for this packet is internal, rather than
      to a core BGP router

  2 - This border router is a FPER (Flow Path Exit Router).

  3 - The packet has a non-zero Flow Label.

this FPER router now performs a special operation.  There are two
forms, depending on the local conditions.

The first form is if this Site-2 has more than one EFR, and where
the packet must be forwarded to the correct EFR:

   The FPER router sets the Flow Label to 0.  (Or perhaps to
   some other value which has a useful meaning inside its
   internal routing system - more on this below.)

   The FPER router performs a mapping lookup on the destination
   address, just like the IFR did.  The FIB needs to do this
   itself, not involving the RIB, unless the mapping has not
   already been cached in the FIB.

      (In most cases, this FIB will already have the mapping
       information cached, since the router will have been
       continually receiving packets for micronets which
       have been mapped to EFRs at this site.  So this will
       typically not involve any delay, communication
       activity or RIB, router processor etc. activity.)

   This mapping lookup produces the address of the ETR to
   which the micronet is mapped. (The micronet which encloses
   the packet's destination address.)

   Now the FPER router needs to forward the packet to that
   EFR.  Perhaps the EFR is in fact this FPER router.


The second form is if there is only one EFR at this site, or if
there are multiple EFRs and all of them can handle the packets for
all the micronets which are mapped to any EFR address in this site.

   In this case, there is no need for a mapping lookup or
   cached mapping information.  The packet is forwarded to
   the one EFR - or to one of the many EFRs.

It is a private matter in the provider network how the FPER gets the
packet to the appropriate EFR.  One approach might be to reserve a
number of the 2^20 possible Flow Label values to have significance
only outside the core: inside provider networks.  Then, a system
similar to that just described can be used by internal routers to
change the Flow label to some value which identifies a particular
EFR in the site.  This way, the packet could be transported from HA
to HB, entirely via the use of the Flow Label, without rewriting any
other part of the packet, and without tunneling, encapsulation etc.


Transition: non-upgraded networks
---------------------------------

The task of this transition arrangement is to ensure that packets
sent by hosts in networks without IFRs are all forwarded to an
OIFRD, where they can have their mapping looked up and their Flow
Label set appropriately.  The same principles which apply to Ivip
OITRDs apply also to FLOWv6 OIFRDs:

  They should be distributed widely around the Net.

  They should be able to handle peak packet rates without
  unreasonable losses.

  Their locations should try to minimise the packet taking an
  overall longest path than it would without FLOWv6.

  They will be paid for by the organisations who rent micronet
  space to end-users



Transition: non-upgraded core routers
-------------------------------------

FLOWv6 is only going to be useful once a substantial number,
probably a majority, of DFZ routers have the FLOWv6 upgrades.  This
is a significant hurdle for deployment, although perhaps tunneling
could be used initially when only a few DFZ routers are upgraded.

There needs to be a way the system can work reliably even when some
percentage of routers are not upgraded - such as 20% or less.

The most important thing to ensure is that each upgraded BGP router,
including the border routers, never forwards to any non-upgraded
router a packet which has its Flow Label set.  The non-upgraded
router is likely to ignore the Flow Label, and do a standard BGP FIB
operation on the destination address.

This undesirable situation would result in the packet being
forwarded towards the nearest OIFRD which is advertising the MAB
which encloses the destination address.  There, according to the
above algorithms, the packet will have its Flow Label set again to
the same value it already has, and that Flow Label will be used to
forward it to a router which should take it towards the network
which has the EFR.

The packet could easily get into a loop and so be dropped, as its
hop count reaches zero.

Folks with with BGP expertise will probably be able to suggest
better arrangements than this, but some possible techniques to
protect against this include:

  Manually configure every upgraded BGP router not to accept
  routes matching E00::/12 from neighbours which are not
  upgraded.

and/or:

  Manually configure every non-upgraded BGP router not to
  accept (and therefore not to offer) any routes to any
  neighbours if they match E00::/12.

There may still be problems with not enough upgraded routers in a
particular part of the core to handle the Flow Label forwarding of
packets.

Perhaps manually configured tunnels to some other nearby routers
would be a solution, but this raises various problems, including to
do with packet length.


PMTUD
-----

This proposal is at a very early stage of development, but it is
possible that there are no PMTUD problems with this approach.

The fact that packets do not get any longer is a major benefit
compared with map-encap systems.  Solving those problems, including
making the best use of jumboframe paths in the DFZ, is quite
challenging:

  http://www.firstpr.com.au/ip/ivip/pmtud-frag/

Assuming the TTL value is still decremented every time the packet is
handled by a router, Traceroute should still work fine through the
entire path, including the section where forwarding is controlled by
the Flow Label.

At any router in this part of the path, if the packet is too long
for the next-hop MTU, the router should be able to send a Packet Too
Big (PTB) message to the sending host.  This is a major advantage
over map-encap schemes, where the source address may be that of the
ITR (not with Ivip, which uses the sending host's address) and where
the too-big packet is longer than and different from the packet sent
by the sending host - resulting in any PTB message not being
recognised by the sending host.

Any translation scheme (Six/One Router is the only one so far) would
have serious difficulties with PMTUD in the translated part of the
path, since the packet has different addresses to those it had when
it left the sending host.  So even if the PTB was somehow sent back
to that host, a properly implemented PMTUD system on that host would
fail to recognise the PTB as relating to any packet this host sent.



TTR Mobility
------------

Any map-encap scheme, and Ivip in particular, can be adapted to
support a global mobility scheme with highly attractive
characteristics.  A paper on this will appear soon.  For now, the
descriptive material at:

  http://www.firstpr.com.au/ip/ivip/#mobile

describes the Translating Tunnel Router approach to extending a
map-encap scheme for mobility.

It is not necessary to change the mapping every time the mobile node
gets a new care-of address.  Typically a mapping change, to select a
new TTR, is only required when the care-of-address moves more than
about 1000km or so from wherever the current TTR is.

The TTR principles should apply in general to a system such as
FLOWv6.  Instead of tunneling packets across the DFZ to the ETR-like
TTR, they would be forwarded according to the Flow Label.

However, the Flow Label approach won't work taking packets to and
from the mobile node and the TTR.  So tunneling should be used for
this, as described in the above-mentioned material.

This raises some PMTUD problems.  Fortunately, the TTR <--> MN
tunnel technology is not related at all to the map-encap scheme or
to the FLOWv6 system, and can be negotiated at set-up time between
the TTR and MN.  This means that there does not need to be a single
fixed technology for this tunneling, enabling a variety of
techniques, innovation, and more localised potential solutions to
PMTUD.

Typically, those tunnels will be two-way and use the same techniques
as encrypted VPNs.  These two-way tunnels are a lot easier to handle
PMTUD over than the situation in a map-encap system, where an ITR
has to get packets to an ETR which it has had no prior contact with
and with which it cannot reasonably engage in extensive communications.




--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
Follow-Ups:
- Re: [RRG] FLOWv6: IPv6 Flow Label to control DFZ forwarding
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- RFC3697 [Re: [RRG] FLOWv6: IPv6 Flow Label to control DFZ forwarding]
  - From: Brian E Carpenter <brian.e.carpenter@gmail.com>
Prev by Date: [RRG] Re: Six/One Router revised 2008-07-12
Next by Date: Re: [RRG] thoughts on the design space 4: encapsulate vs. translate
Previous by thread: [RRG] Six/One Router revised 2008-07-12
Next by thread: RFC3697 [Re: [RRG] FLOWv6: IPv6 Flow Label to control DFZ forwarding]
Index(es):
- Date
- Thread