[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [RRG] Tunnel fragmentation/reassembly for RRG map-and-encaps architectures



Iljitsch,

Please see below for some follow up on this one from
a couple of weeks back: 

> -----Original Message-----
> From: Iljitsch van Beijnum [mailto:iljitsch@muada.com] 
> Sent: Friday, December 21, 2007 12:22 PM
> To: Templin, Fred L
> Cc: Routing Research Group list
> Subject: Re: [RRG] Tunnel fragmentation/reassembly for RRG 
> map-and-encaps architectures
> 
> On 20 dec 2007, at 19:52, Templin, Fred L wrote:
> 
> > Just to have disclosure for the current understanding
> > of this out on the list, the full proposal is given
> > below.
> 
> I think this is a good way to make the fragmentation issue as painless

> as possible IF we choose to accept it, with one important and a minor

> caveat.
>
> So at some point we should try to get consensus about whether we want

> to allow fragmentation or not. We may as well do that now...
> 
> The important caveat is: when do you send back a too big message to  
> the original source? This is clearly what needs to happen if the  
> original source performs path MTU discovery properly, which is  
> fortunately still what most (right?) hosts do. The problem is that you

> can't simply send too bigs for all packets that are too big, because  
> that way you would be performing a denial-of-service attack on hosts  
> that don't honor these messages. And, more imporantly, you'd probably

> be wasting local CPU cycles unless you have silicon for ICMP  
> generation. Would simply rate limiting too bigs be a workable  
> approach? I guess it would be if you don't drop the packet but  
> fragment it, although that behavior wouldn't be expected by the source

> host (but hey, stranger things have happened in the name of the  
> unreliable datagram). If you drop the packet, you'd probably do that  
> for several in a row so TCP would be taking a nap for a while when  
> that happens.

If we can set a BCP threshold for DF=1 packets above which
the original source is expected to implement RFC4821, then
another possibility is to never have the ITR send back a
PTB under any circumstances. This means that we may have to
do some work to get medium-sized packets through the tunnel,
but admit the large and small ones unfragmented w/o ever
sending any PTBs.

I firmly believe we should establish a BCP that says
that original sources that send DF=1 packets of 1501+
bytes are strongly recommended to use RFC4821 (and may
get unpredictable results if they don't). 
 
> > it sends 1500- original packets into the tunnel as
> > either 1-fragment or 2-fragment packets of no more
> > than (750+ENCAPS) bytes each.
> 
> This can probably be further optimized to be aligned with memory  
> buffers.

I don't see much value in the ITR trying to guess at a
memory alignment that might be favorable to all ETRs. That
is precisely why trailers are now historical; there were
only ever a small class of (now-obsolete) devices for
which they provided any real benefit. 

> We could even generalize this to: =< 1024 or > 2048 are never  
> fragmented, 1024 - 2047 byte packets are split into a packet with the

> first 1024 bytes of the original packet and a second packet with the  
> remaining bytes.

This brings up an unspoken assumption - that the interdomain
core is comprised wholly of links that configure MTUs that
are significantly larger than 1KB. Personally, I don't find
that to be an unreasonable assumption; BCP documents like
RFC3819 only talk about setting small MTUs on slow links,
and I don't think we would find many (any?) of those in the
interdomain core. So, can it be reasonably assumed that the
interdomain core is made up of links with significantly
larger than 1KB MTUs (e.g., ~1500 bytes or larger)? I sure
hope so...

Back to the numbers you've mentioned above, I think I like
them and here is why. ITR->ETR tunnels are in-the-network
tunnels and as such there may be many hops between the
original source and the ITR. Within those hops, there may
be additional encapsulations (IPSec tunnels, L2TP, etc.)
such that a 1500 byte packet sent by the original source
might grow to something like ~1750 by the time it reaches
the ITR. So, the 1-2KB fragmentation region allows for up
to ~500 bytes of additional encapsulation overhead between
the original source and the ITR. Similarly, there may be
many hops between the ITR and ETR and possibly additional
encapsulations on the path between the two. So, the 1KB
maximum fragment size allows ~500 bytes of additional
encapsulation between the ITR and ETR in case there might
be a "degenerate" 1500 MTU Ethernet on the path.

So, we could:

  - admit 1KB- packets into the tunnel unfragmented
    and trust that they will make it through due to
    the ubiquitous deployment of Ethernet or larger
    sized MTUs
  - fragment (1KB+, 2KB-) packets into 2-fragment
    outer packets until such time that probing has
    determined that they can be sent as 1-fragment
    outer packets
  - admit 2KB+ packets into the tunnel unfragmented
    and let them sink or swim on their own

This means that, in order to probe our way out of the
(1KB+, 2KB-) pain region we really need for the links
in the core to support significantly larger MTUs than
just a small delta above 1500. So, why not just go
straight to Gig-E and enable jumboframes on all links
in the core? ("Baby-jumbos" of 4KB or larger are
probably OK, too.)   
 
> (Making the packets roughly equal size minimizes the chances of  
> subsequent fragmentation and making the first no smaller than the  
> second avoids the first one from arriving first when the two fragments

> travel over parallel paths, both issues that I don't think are all  
> that important here.)

I like the idea of making the two fragments roughly
equal sized. I'm not sure I get the point about making
the first fragment larger than the second?
 
> > In both cases, the assumption is that original
> > sources that send 1501+ packets are also doing something
> > like RFC4821. This should appear in a BCP document.
> 
> If we use a few bits in the encapsulation header to ask the  
> decapsulator to send ACKs for selected packets (see my message to Dino

> earlier this week, and probably one later today) then the encapsulator

> could do its own RFC 4821 based on the data packets. I think that  
> would be extremely useful.

This is something I have considered many times and have
always come to the conclusion that the benefits are not
necessarily as obvious or significant as they might seem.
First, you need extra bits in the encapsulation header
that go with every packet. Second, you need to have a
nonce such that the ITR can tell that the ACK is legitimate.
(Something like an MD5 digest of the original packet returned
in the ACK might be OK.) But furthermore, you may not always
have a large data packet lying around when you need one.

In that case, there needs to be an explicit probe packet
type that is not a data packet. RFC4821 suggests using an
adjunct protocol (e.g., ICMP echo/request, traceroute,
sprite-mtu) for the purpose of probing. But as you say,
tunnling proposals that use UDP encapsulation could include
a probe packet type inline as part of the UDP protocol type.
That would have the benefit of making sure that the probe
packets get the same treatment as the data packets.

Can we say that we are reaching consensus on some of the
above? Can we get Dino and company to build us routers that
can cope with the 2-fragment cutting and pasting for the
short term that we hope will be rarely needed as the core
transitions to Gig-E?

Thanks - Fred
fred.l.templin@boeing.com 

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg