[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RRG] Re: TRRP Waypoint Routers



Hi Bill,

Thanks for your explanation:

  http://psg.com/lists/rrg/2008/msg00478.html

which helped me understand:

  http://bill.herrin.us/network/trrp-aapip.html

I now recognise that the TRRP Waypoint Router system bears no
resemblance to LISP-ALT.

You wrote:

> A waypoint is a combination ETR/ITR. It can accept packets with 
> any encapsulation that it advertises

I am not sure the Waypoint Router (WR) advertises anything - except
perhaps as noted later where it advertises its prefix in BGP.

It has an entry in the v4.trrp.arpa DNS structure (a domain name and
a record which can be returned by querying that domain name) which
tells ITRs that it accepts packets for a given prefix, with various
parameters for what guarantees it makes about delivering them, and
for how long the ITR should send "initial" packets to it.

> and will if necessary re-encapsulate them in a manner that the 
> next waypoint or final destination accepts.

I understand that the initial ITR treats the WR as it would an ETR -
it tunnels the traffic packets to it.  The WR typically tunnels the
traffic packet either to another WR or to an ETR (perhaps one device
could play both roles).

> Because there are perhaps a couple hundred top-level waypoints

I assume IPv4 in this discussion and that they would be for:

  1.0.0.0/8   2.0.0.0/8  ....  223.0.0.0/8

not counting 10, 127 etc.

> (compared to millions of final destination ETRs) and because 
> waypoint maps have a long timeout it's highly probable that the 
> ITR has already cached a valid waypoint.

OK. Any ITR which has been running for a while will already have
found out about the /8 WRs for the /8s it has been tunnelling
packets for.  WRs have very stable IP addresses and the ITR caches
their details for a long time.

This could cause trouble when you do move a WR.  I guess you would
keep one running on the old address for a week or whatever after it
was last mentioned in the v4.trrp.arpa DNS.

> If it hasn't, the ITR has a search algorithm that guarantees it
> quickly will.

Yes, the first time the ITR handles a packet addressed to some /8 it
has not had a packet for yet, but this will be rare and generally
take 0.4 seconds or less in most cases.


>> Can you give a more concrete example of how these Waypoint 
>> Routers would be structured?
> 
> Sure. Lets say we have a source IP at 126.0.0.1 and he want's to 
> talk to me in the swamp at 199.33.224.1. So, he sends the packet 
> out. Call it packet A. There is no BGP route that covers 
> 199.33.224.1,

Except perhaps as noted later regarding WRs advertising their
prefixes in BGP.

> so packet A follows 0.0.0.0/0 to the nearest TRRP ITR.

I assume this is an igp default route, so the ITR is located inside
some end-user or ISP network.  If you have such ITRs advertising to
the routers of other ASes in the DFZ, then you will be supporting
traffic from non-upgraded networks, just as with Ivip's "anycast
ITRs in the core/DFZ" or LISP's "Proxy Tunnel Routers".


> The ITR doesn't yet have a map for 199.33.224.1. However, he does
>  have a waypoint map for 199.0.0.0/8: apparently the US
> Government has decided to be nice and offer an "initial" mode GRE
> waypoint as a public service at 148.129.75.8, which is within
> globally routable (BGP routed) space.

OK.  The ITR would have discovered this WR at 148.129.75.8 and some
information about it via a record it was sent after a query about
the domain name:

  8.waypoint.199.v4.trrp.arpa

which returned something like "00,wp,limited=5 80,g4,148.129.75.8".

> So, the ITR immediately encapsulates packet A in GRE and sends it
>  to 148.129.75.8. Then it initiates a lookup for 
> 1.224.33.199.v4.trrp.arpa so that it'll be able to send 
> subsequent packets directly.

OK.


> 148.129.75.8 doesn't have a MAP for 199.33.224.1 either. However,
>  I have a private waypoint set up for all of 199.33.224.0/23 in 
> "generous" mode at 71.246.241.146 (which is also within globally
>  routeable space). It accepts GRE as one if its formats. I have 
> made arrangements with 148.129.75.8 to keep this knowledge in his
>  cache. Essentially, I push this knowledge to him.

My first critique is about security.  How can 148.129.75.8 know that
your WR 71.246.241.146 is authorised by you, the person to whom
these 512 IP addresses 199.33.224.0/23 of TRRP-mapped address have
been in some way assigned?

Without some fancy security, an attacker could pretend their WR is
for your /23.

It would be OK if the address you push to 148.129.75.8 is the same
as one of the ETR addresses mentioned in at least one of your
micronets within that range (as it happens to be in this example),
because this is an independent way 148.129.75.8 can ensure that you
control, or authorise the use of, whatever is at this address
71.246.241.146.

Maybe the solution is to have your WR advertised in the DNS.  Then,
you either push this information to the /8 WR or allow it to trawl
through the DNS to find all the WRs for each subdomain (subset of
the /8 space) it finds.

But then it needs to be told if you move the /23 WR - so push is
probably the best approach, with the /8 WR verifying the pushed
information by checking it matches what it finds in the DNS when it
queries:

  24.waypoint.0.224.33.199.v4.trrp.arpa

The ITR sends the initial packets to the /8 WR, because it hasn't
cached this /23 WR information.  It doesn't query the DNS so see if
there is a /23 WR, because that would take as long as asking for the
ETR address.



> So, 148.129.75.8 keeps packet A in GRE and sends it on to my 
> waypoint at 71.246.241.146. He doesn't bother trying to look up 
> 1.224.33.199.v4.trrp.arpa because I've told him that my waypoint
>  operates in "generous" mode.

OK - your WR will accept all packets for your /23, with the idea
that ITRs only send them while they are awaiting a response to their
mapping request.


> My waypoint at 71.246.241.146 receives packet A. He's directly 
> attached to an authoritative DNS server that knows the current 
> TRRP map for 1.224.33.199.v4.trrp.arpa and probably already has 
> it in his cache. In this case, it happens to be directly 
> available (the waypoint was also a final ETR) so 71.246.241.146 
> decapsulates packet A and delivers it to 199.33.224.1.

So far so good.

> Around the same time, the DNS request for
> 1.224.33.199.v4.trrp.arpa from the original ITR reaches the
> authoritative DNS server and the reply starts making its way
> back.

When the response arrives, the ITR doesn't send any more packets via
148.129.75.8, but uses the ETR address it got from the response -
actually it makes a choice from typically multiple ETR addresses.

But what if your /23 is actually split into multiple micronets, and
some of their ETRs are nowhere near your WR?  The system would still
work, but could involve longer paths.


> Note that US Government could have been replaced with Money 
> Grubbing Company and "initial" mode could have been replaced with
>  "limited" mode. The differences would have been:
> 
> 1. The first ITR would have held a copy of packet A and would
> have sent a second copy once the map lookup for
> 1.224.33.199.v4.trrp.arpa succeeded.
> 
> 2. If I didn't pay MGC to pass my packets to me, he would have
> dropped packet A instead of sending it on to my waypoint. I'd
> have had to wait for the ITR's second copy.

Which the ITR would tunnel directly to your ETR as soon as the ITR
receiving the mapping response.


> Note also that 148.129.75.8 would likely announce 199.0.0.0/8
> into the BGP table so that networks without TRRP ITRs could find
> their way to my TRRP ETR. 

OK - to the extent that this is true and workable, TRRP uses a
similar techniques as Ivip's "anycast ITRs in the core/DFZ" and
LISP's "Proxy Tunnel Routers" to attract and tunnel packets sent
from networks without ITRs.

So far, you have mentioned only a single WR.

All you have to do is tell me that these WRs must be, or should
typically be anycast, so multiple such WRs doing the same job are
scattered around the Net, advertising the same prefix - and give me
a name for this technique (maybe "Anycast Waypoint Routers in the
DFZ") and I would say that TRRP is potentially incrementally
deployable, at least in this important respect, with ways of
ensuring relatively short paths and good load sharing between these WRs.

It would be somewhat different for TRRP to have one set of ITRs
around the Net anycasting one or more /8s and another set anycasting
another one or more /8s, but it is close enough to the basic Ivip
concept of each such ITR advertising all Ivip MABs for it to be
regarded as a close cousin.  I am not sure exactly how Proxy Tunnel
Routers would be organised - each one advertising every prefix which
encompasses space mapped by LISP, or some covering part of the space
and others covering other parts.

Ivip already has the concept of load sharing between multiple ITRs
by each one only advertising a fraction of the MABs (Mapped Address
Blocks), so your approach, if implemented with anycast, is the same
as one way Ivip could be used.

In principle, with multiple RUASes each handling a subset of MABs,
one could imagine each RUAS establishing its own set of anycast ITRs
in the DFZ, each set only advertising that RUASes micronets.  It
could start this way, as RUASes competed to support their space with
well placed, high capacity ITRs.  In time, it would probably make
sense for them to form a consortium to run a single set of anycast
ITRs around the Net, or to subcontract their ITR needs to ITRs-R-Us
LLC who would have multiple sites and ITRs there which advertise the
MABs of all its client RUASes.


> How exactly we do this is an open
> issue, one of the unfinished things about the document... Any
> network which -does- have a TRRP ITR shouldn't insert that route
> into its FIB, or should locally override it from the ITR. How
> does it know to do so? It's my "holey routes" problem wrought
> large.

It is a pretty easy problem to solve if you only have one WR for
each /8.  Whether they are anycast or not, and whether or not you
have multiple IP addresses for each /8 WR (so the ITR can choose
between multiple physically separate WRs), as long as you only have
a few hundred of them and they are all handling /8s, then you can
afford to have a relatively static configuration item in every TRRP ITR:

   Ignore any /8 advertisements on the following list of /8
   prefixes for the purpose of deciding whether there is a normal
   BGP route to those prefixes.  Those are all actually routes to
   WRs, which we may choose to use for initial packets.

However, this few hundred /8 WRs looks to me like it won't scale.

My second critique is that you really can't have a single WR for
each /8, for a number of reasons.

1 - Too much load on a single WR.

      Fix with multiple machines at the one site, or by using
      anycast.

      Also fix with multiple WR addresses for the one /8, so
      that the ITR can choose any one and so spread the load.

      You could also split the system up into a much larger
      number of WRs, each for a longer prefix.  However,
      that makes it less likely each ITR will have the WR
      it needs already in its cache.  (ITRs could scan the
      DNS to a certain depth to find all WRs, say to
      /16 - and cache all them.  However, you need to keep
      them stable, or have the ITRs periodically scan the
      DNS quite often.)


2 - Single point of failure.

      Solvable with anycast and with multiple WR addresses - but
      how could an ITR know it was sending packets to a WR
      which was working, if it just got the address from
      DNS some time ago?  One of those going down would blackhole
      some subset of the initial packets to this /8.


3 - Long paths.

      The ITR is in Shanghai and so is the ETR it would choose
      to tunnel the packets to, but the ITR doesn't know this
      yet because the mapping response hasn't arrived.  (Maybe the
      nameservers haven't been fully anycast as I suggested in my
      message 488.)

      The /8 WR is in Washington DC, so the initial packets have
      to cross back and forth across the Pacific and the USA.

      This is slow and inefficient.

      Solution 1: Anycast the WRs widely.

      Solution 2: If anycast is not used, supply a number
                  of WR addresses and somehow enable the
                  ITR to figure out which is closest.

      Long paths mean long delays, larger costs of
      carrying traffic which really should just be going
      from one place to another in Shanghai, and a greater
      chance of the packet being dropped.

4 - A /8 may need to know an excessive number of ETRs or other
    WRs to send packets to.  The /8 could handle the TRRP mapped
    space for hundreds of thousands of end-users, and each one
    may have two or more micronets, which are currently mapped to
    ETRs in very different locations.  Then, each such micronet
    will need its own WR, unless you are prepared to live with
    overly long paths for initial packets.

    This is probably OK, but it requires some really snappy
    administration to make it secure.  Maybe the DNS checking
    system I mentioned is what you had in mind.

Maybe the the WR system could be defined somewhat differently,
sticking with the original concept of a single WR for a whole /8.

We assume that the entire mapping database is too big for any router
to cache - the whole basis of TRRP's or ALT's appeal is that it
scales endlessly.

However, if there is a /8 WR somewhere which gets all the early
packets from every ITR in the world, then that can operate simply as
a super ITR, and will have already cached the mapping for pretty
much any mapped address which has recently been receiving traffic.

Whether it is /8 or /12 or /16, you have a bunch of these things,
called WRs, and each is just a fast, large cache, ITR.  It doesn't
need the concept of end-user specific WRs - it just looks up and
caches the mapping like any other ITR.  The advantage is that it
will generally already have the mapping in cache, so it will be able
to tunnel the packet to the ETR immediately.

However, you really need lots of these around the Net, anycast.
Then, the anycasting dilutes the traffic seen by each one, which is
good in one way, but increases the chance that each one won't have
cached mapping for the micronet when it needs it.



To soup up the WR system as I understand you described it - 200 or
so /8 WRs - I think the most obvious thing to do is to anycast them
and to split them up to handle longer prefixes each (less address
space).

A suitably anycasted system could have a complete set of WRs at 36
locations around the Net, as I listed in message 488.  They might
as well be in the same locations as you have the complete set of
anycast nameservers for the trrp.arpa domain and subdomains.

Now, there's no real reason for an ITR to send both a mapping
request and the initial packet as separate things, because both
packets would go to the same anycast site.

You could simply have the ITR tunnel the initial packet to some
anycast address, knowing it will find its way to your nearest
anycast site, and be tunnelled quickly to your chosen WR (which you
chose the location of to be close to, or at, your ETR site).

The source address of the outer header tells your anycast server the
ITR's address and the destination address of the inner packet tells
it what IP address the ITR needs the mapping for.

This is just like APT's Default Mapper, but is anycast to a few
dozen sites around the Net.

This is a hybrid push-pull global network of a few dozen Default
Mappers.  You are pushing the entire database, plus information
about WRs, to these few dozen sites.  The more sites you have the
greater the load sharing and the lower the delay time in getting
mapping to the ITRs.  Likewise the shorter the extra path length
travelled by initial packets.

It would overcome much of the delay and bottleneck problems of TRRP
as you currently propose it:

  Anycast DNS servers authoritative for each /8 of the address
  space, but probably separate sets of anycast servers at
  different sites depending on which RIR or whatever is running
  these DNS servers.

  DNS servers for the /24 prefixes are far more numerous and run'
  by far more organisations than the few RIRs etc.  So presumably
  this means they are not anycast, and that delay times for
  this second request will often be pretty long.  Likewise, worse
  risk of packet loss.

  220 or so sites, each for a WR for an entire /8.  Likewise
  scaling problems, long path problems etc. meaning more delays.


>> To what extent does your system resemble ALT, and to what
>> extent does my critique of ALT apply to your system?
> 
> Without having reviewed ALT in any detail, my best guess is that
> it doesn't share much besides the notion of a highly aggregated
> alternate path.

OK.  ALT automatically passes the packet up and down a hierarchy,
which is strongly aggregated with the unfortunate result that each
such router could be almost anywhere, so the overall path length for
the whole ITR to ETR trip could be very long indeed.

By "highly aggregated" in TRRP, you mean the /8 WR already knows the
exact IP addresses to send packets to for every micronet in its /8.


> ALT sounds like it might work with static tunnels or private
> lines using something close to standard BGP. On the down side, it
> would require a complex dance between thousands of operators to
> get it going. 

Yes.

> With Waypoints, the complex dance is at the RIR
> public policy level getting authorization to announce for that
> supernet. Once authority is obtained, they hook up at any old
> place and announce the prefix.

OK - it is very direct - /8 WR to /XX WR, for each micronet or for
whatever subset of the space the end-user chooses.

I assume that micronets for IPv4 can be as small as a single IP
address.  Will the RIRs be happy to run a WR which has theoretically
up to 2^24 separate destination WRs?

What about traffic volume levels?  Right now, RIRs probably charge
you and Google the same for X amount of address space.   But if
Google really gives their WRs a hammering, the RIRs should be
charging Google according to their higher traffic volume.

>> By your own description, the Waypoint Router path is "long" - 
>> compared to going direct in a tunnel to the ETR (the address of
>>  which is not known at this time).  Presumably this "long" path
>> will be faster than waiting for the mapping information to
>> arrive.
>> 
>> Do you have estimates for the delay times?
> 
> My SWAG is that the initial round trip will be 1.5 to 2 times the
>  normal round trip with some single-digit percentage taking long
> enough to recognize no gain versus bare TRRP.

I don't clearly understand this.

If you have a single WR for each /8, then some ITRs are going to be
on the other side of the Earth with respect to it, and so are the
ETRs they are trying to send packets to.  Worst case delay times
could be long unless you have an elaborate anycast network.

  Regards

    - Robin





--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg