[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Thoughts on the RRG/Routing Space Problem - millions of micronets



Short version:

  We can be pretty relaxed about the proliferation of micronets,
  since the total burden of each such micronet should be a factor
  of a thousand or more times smaller than the current global cost
  of adding a prefix to the DFZ.  Nonetheless, I think there may
  need to be a small monetary charge for each change to a micronet's
  mapping.

  I also discuss:

    Implementation of ITR FIBs in the context of massive numbers
    of mapping changes, most of which are for addresses for which
    the ITR will never handle packets.

    ITRs which are full database ITRDs for some Mapped Address
    Blocks (MABs) and querying ITRCs for others.

  (MAB & micronet terminology is explained in:
   http://psg.com/lists/rrg/2007/msg00533.html )


Hi Russ and Stephen,

I understand you are concerned about the growth in the number of
separately handled pieces of address space ("micronets") in the
future ITR-ETR system.


Stephen, you wrote:

> This is the problem the DFZ sees today, yes, and the goal is
> to get all of that junk out of the routing (locator) tables.
> It's natural to assume all that junk will then inhabit the
> mapping tables -- and probably multiply like rabbits.  That
> means we've traded one problem for another, but we seem to
> be operating on the assumption that a mapping table is easier
> to scale than a routing table.

Russ, you wrote:

>> The EID-to-RLOC mapping can be far more granular since it
>> doesn't impact routing tables; you can give individual EIDs
>> different mappings if desired. This is far superior for
>> destination-based TE than what we have today, where you have to
>> apply BGP-based TE to an entire /24 or even /20.
> 
> Doesn't this increase the size of the Locator table, over time,
> to be precisely the size of the EID table? There appear to be
> several drivers in place to push more discrete routing into the
> fore--again, see below the sig for instances.
> 
> The net result is granularity ends up at the host level, or
> perhaps below, at the flow level. 

The concept of separate handling for distinct flows is not part of
the goals of the ITR-ETR schemes.  Their choice of which ETR to
tunnel the packet to depends solely on the packet's destination
address.  Some schemes may load-split to multiple ETRs, but as far
as I know, none of them do this on the basis of the characteristics
of the packet.

Since there is so much work on QoS and CoS, I suppose we should
expect ITRs one day to have to worry about this stuff.

ITRs are in the global Internet, and any prioritisation of one type
of packet (including according to destination or source addresses)
involves an allocation or withdrawal of router resources - and so
arguably needs to be paid for by someone.  (Unless the purpose is to
block bogus packets, or perhaps to discourage a P2P filesharing
system which threatens to overwhelm the Net...)  I think we are a
long way from the payment schemes which would allow end-users to pay
for defined QoS on the public global Internet.  Still, an ITR-ETR
scheme may need to cope with this in the future.

But perhaps you meant the ITR identifying flows and sending their
packets to different ETRs, which is not necessarily the same as
prioritisation or reserving router resources.


> I'm not certain mapping helps
> this--because mapping assumes that once you hit the leaf/transit
> divide, the granularity levels automatically decrease. I'm
> concerned this no longer true--the 80/20 rule is in it's last
> gasps, I think, and we have to figure out how to work in a world
> with 20/80, maybe, or 100/100, in some sense.
> 
> It's easy to say: "We'll map at the edge, and then aggregate the
> space we're mapping in to." Then along comes a situation where we
> need more granular traffic engineering than the mapping we're
> using, so we break the mapping space up into smaller chunks, and
> give each piece a more granular piece. This is of little cost to
> the provider who does it--it only increases the mapping table
> size by one--and gives them much more control over traffic flow
> to destinations for which traffic transits their network. There
> must be exceptions, and exceptions turn out to be the rule.

With all the ITR-ETR schemes, the cost to the whole system of adding
a micronet is far smaller than the cost of adding another prefix to
the DFZ.

One reason is that the new micronet only affects ITRs, and there are
assumed to be less ITRs than routers in the DFZ - or at least less
total traffic load on the ITRs than the DFZ routers.  Even assuming
all traffic went through the ITR-ETR scheme, the "average" packet
flows through potentially one or more DFZ routers, then one ITR and
probably multiple DFZ routers - before reaching the ETR which is
pretty close to the destination host.

Another reason is that the cost to each ITR in there being another
micronet is very small compared to the cost to a DFZ router having
to cope with another advertised prefix.

A "push" scheme which sends all mapping information to ITRs probably
involves the highest incremental total cost (to everyone) per extra
micronet and per change in mapping.

Mapping information about each distinct micronet, each time it
changes its mapping, would need to be sent to every ITR in the world.

Ivip mapping information for IPv4 is pretty compact:

  32 bits         Start of micronet
  16 to 32 bits   Length of micronet
  32 bits         ETR address (0 means drop the packet.)

So this is 12 bytes per change, plus protocol overhead. If there are
a million ITRDs and QSDs to send this to, then each change involves
an additional 20 Megabytes or so of traffic (including protocol
overhead).  The cost of this is not so high, but there probably
needs to be a charging system - so mapping is only changed when the
end-user has a reason strong enough to justify the expense.

There are likely to be some practical limits to this.  For instance
sending the mapping information for every micronet in China to some
ITR in Patagonia seems inefficient, since the ITR generally handles
virtually no traffic to those micronets.  (Though Chinese tourists
will soon be all over the world, including probably Patagonia.)

In a "pull" scheme such as LISP-CONS or TRRP where the ITR caches
the mapping information, and requests it from some distant system
whenever it needs it, the cost burden of extra micronets is not so
obvious.  The only way it would impact the ITR would be if the
increased number of micronets resulted in greater numbers of
queries, dropped packets or cached mapping information.

If the ostensibly "pull" ITR-ETR scheme was intended to push news of
mapping changes to ITRs which are currently handling packets for the
micronets whose mapping has changed, then each such change imposes
quite a high cost on the ITR system.  TRRP proposes to do this -
though I think the currently proposed mechanism isn't complete or
robust.

Returning to a purely "pull" mapping system, in which all ITRs are
caching ITRs, making queries to some distant or local query server(s):

At Time A, an ITR is handling traffic to a range of individual
addresses in 11.22.33.0/24, which is all one micronet.  The first
time it asks for the mapping for one of these addresses, it gets a
reply which covers this entire 256 addresses.  Subsequently, when it
receives packets for other addresses in that /24, it can handle them
directly.  There are no more delayed or dropped packets and no
queries - all packets to this /24 are are tunneled to the one ETR.

At Time B, when this /24 has been split into a dozen or so
micronets, as the packets come in, addressed to various addresses in
the /24, the ITR will need to make more queries than in the past,
and cache more mapping information, since each mapping reply only
applies to a small subset of the whole /24.  This also means more
delayed or dropped packets and probably more CPU time.  (See
discussion of FIBs later.)

So with a pull system such as this, I think the cost impact of more
micronets is due to something like the reduced commonality between
the mapping information for the destination addresses of the packets
the ITR is actually handling.  Then, the router in Patagonia has
absolutely no costs from millions of micronets in India or China,
until - as will happen only occasionally - it handles packets
addressed to such distant micronets.

Even then, it only has an increased cost if there are packets to two
addresses which used to be in one micronet, but are now in two
separate micronets.

The cost of each prefix in the DFZ is really heavy.  It creates
extra load on each DFZ router - probably more than 210k of them:

  http://psg.com/lists/rrg/2007/msg00253.html
  http://psg.com/lists/rrg/2007/msg00255.html
  http://psg.com/lists/rrg/2007/msg00257.html
  http://psg.com/lists/rrg/2007/msg00262.html

proportional to the number of interfaces on each router, and to the
rate of change in how that prefix is advertised:

  http://psg.com/lists/rrg/2007/msg00630.html

Also, each change can result in BGP instability, such as path
hunting, which causes multiple messages to be received and sometimes
sent by each router.  (See the RAM list and the IDR WG lists in
early July.)

While a push or pull ITR-ETR system can cope with the changed
mapping quickly and directly, because it is specially built to do
so, the global BGP system relies on all the routers comparing notes
with their peers, with the new prefixes details spreading across the
DFZ one router to the next, as each uses very limited information to
figure out how best to send packets to each prefix.


The total number of micronets in the system can be in the millions
or hundred of millions - but pure push systems do have higher costs
as the number of micronets increases.  Likewise with more changes to
the micronets' mapping.

At least, with a push system, there is a much more straightforward
global distribution system than having it spread via the bandwidth-,
CPU- and memory-intensive process of BGP messages router-by-router.


People are getting nervous as the DFZ has to handle 250 thousand
prefixes.

I think we are planning ITR-ETR schemes so we will be relaxed about
them running with 250 million micronets.

If the new architecture means we can have 1000 or more micronets for
approximately the same cost (in terms of general burden on whoever
runs the Internet's main routers) as having an extra BGP prefix
today, I think this would be a successful outcome.  Maybe we can do
better.

Stephen, you wrote:

> Ivip's goal of fast propagation is interesting, but the thought
> of millions of sites updating their EID mappings every few
> seconds based on server loads (and you know someone would try
> it...) scares me.

Despite each micronet being much less of a burden, I think there
needs to be some kind of charging scheme for micronets and changes
to their mapping.  This should at least deter "frivolous" use
(however defined - by those who run ITRs) and might, ideally, to
some extent fund the ITRs or the push distribution / query system
which gets the mapping data to the ITRs.

Although Ivip and eFIT-APT have caching ITRs with local query
servers, there are potentially high costs with a full database ITR -
an ITRD in Ivip - which has a full line-rate FIB which is (ideally)
totally up to date with every change in the global mapping system.

It is costly enough having the ITR keep a track of all mapping
changes, maintaining its own copy of the total mapping database.
Unfortunately, it is also costly to reprogram its FIB for every
received change in mapping.  Also, the FIB probably has some
capacity limitations in terms of the number of separate divisions in
the address space it can classify packets to, at full line rate.

Older FIB technologies, based on TCAM "memory" (actually massive
comparators and prioritisers) are power hungry and conceptually
simple.  They can classify a packet in a clock cycle (in principle),
but their update procedure is really messy.  Sometimes, a new rule
needs to be inserted in a way which requires moving many other
rules.  During this process, the TCAM can't be used for classifying
packets, so traffic is delayed or dropped.

TCAM isn't used in some (all?) new high end routers.  Instead, the
Tree-Bitmap algorithm, with lots of RAM and CPU resources is used -
with each CPU chewing its way through the destination address
navigating a tree structure in (slow, shared) DRAM, with probably
multiple read cycles per packet.  This is pretty frightening - it is
so labour intensive, and even worse for the longer addresses of
IPv6.  However, I understand that it can have a shorter update time
than TCAM - maybe without slowing or stopping packet forwarding.  In
particular the worst case FIB update time is not much different from
average, while TCAM's worst case time can be horrendous.


(I am assuming a newly engineered FIB, optimised to encapsulate the
packet to a single ETR, based solely on the packet's destination
address - as is the case with Ivip.  Current router FIBs are not
necessarily optimised for tunneling packets like this.  The other
ITR-ETR schemes involve more complex mapping data and the ITR making
decisions about which ETR of several to tunnel the packet to,
according to what it knows about reachability, and perhaps load
sharing.  Such an FIB would be more complex than that which is
required for Ivip.)

Let's say the FIB is implemented in what is currently, AFAIK, the
biggest, fastest device available - the $80k Cisco CRS-1's MSC card.
That is a lot of money and power (375 watts) for a single printed
circuit!  However, it handles 40Gbps of incoming traffic.

The ingress FIB is implemented in a large slab (1 or 2 gigabytes) of
DRAM and the world's largest ASIC, with 188 32 bit 250MHz CPUs -
reportedly running the Tree-Bitmap algorithm.

I assume that updating the FIB has costs, so I want to find a way of
avoiding updating it except for addresses for which the ITR is
currently handling packets.

I propose the FIB handles arbitrary packets and deals with them with
the following options, which it figures out according to its FIB data:

  1 - Drops the packet.

  2 - Forwards it to a specified interface - or queue within that
      interface.  (This is the standard router FIB function - which
      would handle packets sent to ordinary BGP addresses - RLOCs in
      LISP terminology.  These would be ordinary traffic packets
      to ordinary non-mapped addresses, and packets tunneling
      encapsulated traffic packets to ETRs.)

  3 - Recognises that the packet is addressed to a micronet for
      which the FIB has already been configured.  The packet
      is encapsulated and tunneled to the correct ETR. (Probably
      means the outer packet is presented again to the FIB, which
      will forward it as in 2 above.)  We want this to be the
      pathway for the great majority of packets the ITR needs
      to encapsulate.

  4 - Recognises that the packet is addressed to one of the MABs
      (prefixes which contain micronets) in the ITR-ETR scheme
      but that the FIB has not yet been configured with the
      correct ETR address for this micronet.  Probably, the
      specific micronet which the packet is addressed to would
      not be identified by the FIB.  The return result is:

         "We probably need to tunnel this packet to an ETR.
          Send the packet to the CPU (or maybe some fancy
          process in the FIB) so it can query the local copy
          of the mapping database (for an ITRD) - or query a
          query server (for an ITRC).  Then, the packet can be
          encapsulated and tunneled to that address, and that
          process will also update the FIB data so that in future,
          packets to this address will be tunneled directly by the
          FIB according to 3 above."


This means the ITRD in Patagonia can dutifully collect the mapping
data for all the world, in the ordinary RAM of its CPU.  It will
only update its FIB when it needs to - "colouring in" the FIB tree
from its default "4" colour to have the specific details it needs to
quickly tunnel packets addressed to the micronet which this packet
is addressed to.

So initially, the FIB might be in this state.  Here I am looking at
a small section at the bottom of a MAB which is 11.22.33.0/16:

   11.22.33.00    to   11.22.33.15   >> Tunnel to 55.66.77.88
   11.22.33.16    to   11.22.33.255  Handle with process 4
   11.22.34.00    to   11.22.34.256  >> Tunnel to 99.88.66.22

Then a packet arrives addressed to 11.22.33.32.

At this time, the relevant portion of the mapping database, as held
in CPU RAM in an ITRD or in a nearby full database Query Server is:

  11.22.33.00  length  16  >> Tunnel to 55.66.77.88
  11.22.33.16  length   1  >> Tunnel to 7.111.34.120
  11.22.33.17  length   1  >> Tunnel to 203.22.58.101
  11.22.33.18  length  14  XX Drop - no ETR.
  11.22.33.32  length   8  >> Tunnel to 147.32.115.201
  11.22.33.40  length  24  >> Tunnel to 23.49.176.227
  11.22.33.64  length  64  >> Tunnel to 156.2.45.128
  11.22.33.128 length 128  >> Tunnel to 43.94.110.33
  11.22.34.00  length 256  >> Tunnel to 99.88.66.22

The CPU tunnels the packet and updates the FIB - or maybe updates
the FIB and lets the FIB tunnel the packet.   After this, the FIB is
ready to handle further packets addressed to 11.22.33.32, or any of
the next 7 addresses, without further CPU involvement:

   11.22.33.00    to   11.22.33.15   >> Tunnel to 55.66.77.88
   11.22.33.16    to   11.22.33.31   Handle with process 4
   11.22.33.32    to   11.22.33.39   >> Tunnel to 147.32.115.201
   11.22.33.40    to   11.22.33.255  Handle with process 4
   11.22.33.00    to   11.22.34.256  >> Tunnel to 99.88.66.22

Then there needs to be traffic counters (the CRS-1 MSC has 1M 64 bit
counters) so the CPU can check which micronets are still being used,
so some of them can be deleted from the FIB when no longer needed.


An ITRC must operate as described above - but the CPU needs to send
a query to a nearby (somewhere in the same network, not across the
world) query server.

So I am suggesting that ITRDs can minimise the costs of frequent and
generally non-useful updates to their FIB by only updating the FIB
to handle current traffic, but doing so very quickly, since a copy
of the full current database (a few seconds delayed from the various
centralised sources) is maintained in a compact form in ordinary CPU
memory in the same device.

The purpose of this is to allow the fast responsiveness, robustness
and conceptual simplicity of a full push system with an ITRD,
including the costs of the mapping data being sent and processed by
the CPU, but without every such mapping change being a burden on the
FIB hardware.  That FIB hardware is expensive, chews a lot of power,
and is generally *busy*!

Another approach for maintaining ITR responsiveness while reducing
costs from that of a full push system is to have an ITRD which gets
full push mapping changes for some MABs and not for others.  So the
Patagonian ITR would get the full feed of MABs which were generally
used as destinations for packets sent from South America, but
function as an ITRC - with some not-so-close query server (hopefully
closer than the other side of the world) for MABs which generally
handle traffic to end-user networks in Siberia, India and China.

I don't know how to structure a global mapping distribution system
in the Ivip ("a few seconds, ideally") style so there could be
arbitrary splits of the mapping data.  I will think about it once I
am happy with a system which simply gets the same data to every ITRD
and QSD.

  - Robin        http://www.firstpr.com.au/ip/ivip/


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg