[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: MTU stuff, was Re: [RRG] LISP-NERD reachability and MTU detection

To: "Iljitsch van Beijnum" <iljitsch@muada.com>, "Dino Farinacci" <dino@cisco.com>
Subject: RE: MTU stuff, was Re: [RRG] LISP-NERD reachability and MTU detection
From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
Date: Tue, 18 Dec 2007 08:05:14 -0800
Cc: "Routing Research Group list" <rrg@psg.com>
In-reply-to: <F004A384-2347-4CCF-8D5E-481EDD0C62D4@muada.com>
References: <EAB3BF96-D438-459E-A753-F9D72B1FE5B6@muada.com> <EC1BB972-6F21-4CBC-B827-BB1840C25AE8@cisco.com> <99E69FED-D637-4F47-985E-6AB0DCB0B8E0@cisco.com> <54ADF2DD-8FFC-45BD-9C56-BD548A91528B@cisco.com> <39C363776A4E8C4A94691D2BD9D1C9A1029EDD10@XCH-NW-7V2.nw.nos.boeing.com> <F004A384-2347-4CCF-8D5E-481EDD0C62D4@muada.com>

Iljitsch and Dino, 

> -----Original Message-----
> From: Iljitsch van Beijnum [mailto:iljitsch@muada.com] 
> Sent: Tuesday, December 18, 2007 4:14 AM
> To: Templin, Fred L; Dino Farinacci
> Cc: Routing Research Group list
> Subject: MTU stuff, was Re: [RRG] LISP-NERD reachability and 
> MTU detection
> 
> On 17 dec 2007, at 20:27, Templin, Fred L wrote:
> 
> > Key considerations are: 1) 1500 bytes has become the
> > "magic number" expected by applications
> 
> Applications??

OK; make that "expected by the original source" (w/o
specifying at which layer the 1500 byte assumption is
embedded).

> > 2) 1280 bytes is
> > the "magic number" specified for IPv6, and 3) fragmentation
> > at the TFE MUST be kept to a minimum in order to avoid
> > reassembly misassociations at the TFE. Of these, IMHO 3) is
> > the dominating consideration followed distantly by 1). ( 2)
> > is the hard lower bound for IPv6, and we can't change that.)
> 
> > In particular, I want to see a requirement that TNEs MUST NOT
> > configure a fragmentation threshold larger than 1500 bytes
> > for the packets they admit into the tunnel.
> 
> I don't think the (main) problem is packets larger than 1500 bytes. If

> you generate those, you pretty much know what you're doing (or soon  
> will).

By "fragmentation threshold", I mean the size below which
we will allow (grudgingly!) some minimal outer packet
fragmentation at the ITR, and above which we will not.
As below, 1501+ packets with DF=1 will *not* be fragmented
by the ITR. 1500- packets *may* be fragmented by the ITR
and/or a network middlebox, but we would prefer not
fragment them either. 

> The issue is when tunnel overhead over a 1500-byte path breaks  
> the 1500-byte assumption that is created by the fact that people  
> filter ICMP too big messages without bothering to disable path MTU  
> discovery.

Yes.
 
> > Specific transitions I would like to see include:
> 
> >  1) Require that all TFEs configure an EMTU_R that is no
> >     smaller than 2KB and at least as large as the smallest
> >     EMTU_R of all underlying links over which the TFE is
> >     configured. (IMHO 2KB is a good number because it
> >     allows for a 1500 byte fragmentation threshold at the
> >     TNE yet allows room for additional encapsulations
> >     on the path.)
> 
> If the reassembly happens in the destination host this shouldn't be an

> issue in practice because of the TCP MSS option, if it happens in a  
> middlebox we can mandate a number, and 2048 seems like a conservative

> one, or we can specify a way for the destination to let the source  
> know what the number is.

Keep in mind that when DF=0 in the original packet there
may be *two* levels of reassembly; a tiny bit of reassembly
at the ETR and a potentially larger amount of reassembly at
the final destination. But, the most we will ever ask the
ETR to reassemble is 1500 bytes.

> >  2) Require that all links transition to adopting IEEE
> >     802.3as Ethernet Frame Size expansion, or better yet
> >     Gigabit Ethernet Jumboframes.
> 
> There is already a large amount of equipment out there that does "baby

> jumbos" which should be enough to allow encapsulation of a 1500 byte  
> packet without problems,

That's good.

> but there's also still a lot of 100 Mbps and  
> some 1 Gbps equipment out there that can only do 1500 or 1504.

That's not so good.

> I  
> believe that a new effort like this allows us to require people to  
> upgrade their MTUs, something that's pretty much impossible to do at  
> any other time, so I would be in favor of doing so.

Agree; maybe we can get Dino to help give a push in this
direction.
 
> >  3) Require that all original sources that send packets
> >     of 1501 bytes or larger with DF=1 also implement
> >     RFC4821.
> 
> Not really an issue, in my opinion. If you send large packets you  
> either need to implement RFC 4821

You (i.e., the original source) can control this.

> or you need to make sure that you  
> hit a 1500-byte hop that reliably sends you too bigs before you enter

> the big bad internet.

The original source can't control this.

> If either of these are impossible (and assuming  
> TCP MSS clamping isn't an option) you can't realistically have an MTU

> larger than 1500 bytes.

No, but we can go for a BCP that says: "original sources
the send 1501+ packets with DF=1 are strongly recommended
to implement RFC4821. IMHO, there is a BCP to be written
based on sprite-mtu and also on these discussions which
can help bail us out of many of the issues. 

> On 18 dec 2007, at 2:36, Templin, Fred L wrote:
> 
> > Adding a means for the ITR to discover the ETR's EMTU_R
> > is something I have proposed in numerous earlier efforts,
> > and also something I have considered for sprite-mtu. But
> > AFAICT, we really don't want the ETR to be reassembling
> > fragmented outer packets any larger than 1500 bytes;
> > instead, the ITR should send packets larger than 1500
> > bytes in one piece and/or send back a PTB if they are
> > too big.
> 
> Fair enough.
> 
> However, encoding a specific packet size that triggers different  
> behavior makes me uncomfortable.

Until all network gear everywhere gets upgraded, we have
to draw a line somewhere and IMHO 1500 bytes is the right
place. The line can be drawn in pencil and should fade
over time. If 10 years from now a network engineer reads
in a history book that the Internet once had a cell size
of 1500 bytes and scoffs in disbelief, then we will have
done our jobs properly.
 
> > So, IMHO all that needs to be known about the ETR is the
> > binary as to whether it can reassemble up to 1500 bytes
> > or not. If we say that all ETR's must be able to
> > reassemble up to 2KB (enough to cover the 1500 byte
> > packet plus any additional encapsulation overhead)
> > then maybe there isn't all that much to be gained by
> > an explicit EMTU_R discovery exchange?
> 
> Well, if you don't want to reassemble the EMTU_R would be moot, and

Remember that the recommendation is to grudginly allow a
tiny bit of fragmentation at the ETR. The reason for this
is that we can manage and dampen any fragmentation that
occurs at the ETR but we cannot manage and dampen any
fragmentation that goes all the way through to the final
destination. (Also, in the IPv6/IPv4 case, there is no
option to allow IPv4 fragmentation through to the IPv6
destination in the first place.) 
 
> pretty much also if you only want to reassemble packets that hover  
> around the magic 1500-byte mark because obviously any real-world  
> device that's going to be created will be able to support that size if

> it supports reassembly in the first place. Still, mentioning a  
> specific size, such as 2048, in that case would probably be useful.

Good; more BCP material.

> On 18 dec 2007, at 0:01, Dino Farinacci wrote:
> 
> > I am not advocating that the ETR reassemble here. I want to make  
> > that clear.
> 
> Ok. That is a reasonable position.

But if you let it *all* go through to the final destination,
then RFC4963 hits home. To be more precise, the ITR should
let gross fragmentation go through to the final destination,
and then the original source can be rightfully blamed if
anything goes wrong. But, it should work with the ETR to
mitigate tiny fragmentation to uphold the principle of
least surprise for original sources that expect to see 1500.

At this point, I would like to suggest Dino (and maybe also
you, Iljitsch) to have another look at sprite-mtu and send
comments, since many of these points are already addressed
in that document.    
 
> >> You can't fragment IPv6 packets or IPv4 packets with DF=1.
> 
> > Right, you have to obey the protocol spec. So packets will get  
> > dropped with DF=1. And people turn off ICMP messages as well.
> 
> In my opinion, building devices that can't forward 1500-byte packets  
> without fragmentation and deploying them in ISP networks is a non- 
> starter*. You ruled out reassembly by ETRs so this means that we  
> either have to compress the encapsulation overhead to 0 bytes (=  
> translation) or we have to require larger MTUs in the entire path  
> between any ITR and any ETR.

Working toward larger MTUs in the entire path is IMHO the
right and proper thing to do - more BCP material. But, we
won't get there overnite.
 
> * You could have ITRs that can't handle 1500 bytes if those are under

> the control of the source site because then the source site can make  
> sure that the too bigs the ITR generates are acted upon. But if there

> are _some_ ITRs that need to send 1500+ byte packets then _all_ ETRs  
> must support this, too.

I don't necessarily agree that all ETRs need to *support*
1501+ packets, if that is what you mean to say. ITRs
should send 1501+ with the understanding that they are
at risk of loss due to MTU restrictions. It would be
helpful for the ITR to be able to probe the path to the
ETR (and for the ETR to respond to probes), but that does
not mean that the ETR necessarily has to accept anything
larger than 1500.  
 
> > So what's the difference if packets get lost doing a mapping lookup

> > (everyone is so sensitive to packet drops there) but for MTU  
> > discovery purposes it's okay to drop packets?
> 
> Depends on how many packets get dropped. But the fundamental  
> difference is that between dropping the first packet or a later one.  
> With the first packet, TCP doesn't know if the other side is reachable

> and it doesn't have an RTT estimate yet, so recovering from that is a

> lot slower. Also, if PMTUD is properly deployed, the packet that was  
> too big will be immediately resent after receiving the too 
> big message.
> 
> > Do you think 1500 byte MTU links will still be around say 5 years  
> > from now? Maybe it's time to clean up some links on the network. I'm

> > sure vendors can provide incentive to do this.  ;-)

Yes, let's push for this - more BCP material. We won't get
there overnight, but we should strive to get there asap.

At this point, I am seeing at least two and possibly three
documents:

  1) a BCP telling what updates we would like to see for
     networking equipment and link MTUs. Can be taken from
     excerpts of sprite-mtu and from numerous list postings
     and off-list emails.
  2) sprite-mtu updated based on IETF70 comments and with
     the BCP materials removed.
  3) Possible also Iljitsch's document, but would have to
     go as experimental due to ND changes. 

Fred
fred.l.templin@boeing.com
 
> Well, you work for a vendor. You guys ship tons of product that can  
> handle 1500+ byte MTUs (and some that can't) but AFAIK, in each and  
> every case, ethernet interfaces on routers have their MTU set to 1500

> by default.
> 
> I did get some good feedback when I presented my variable MTU subnet  
> draft in Chicago but not much after that. I'm going to see if I can  
> get it published as an experimental RFC anyway. Hopefully, that way we

> really can get rid of those 1500-byte MTUs in the next five years.  
> (But I'm not holding my breath.)
> 
> >> We have both the potential to do very quite things (trigger broken

> >> PMTUD)
> 
> I was going for "quite harmful"
> 
> >> and very useful things (give people an incentive to deploy  
> >> jumboframes, create the first MTU-robust tunneling mechanism) here

> >> so we should aim to get things right the first time rather than  
> >> repeat the mistakes made with RFC 1191.
> 
> > When you think it is right, it will change. It's been a continual  
> > moving target with multiple moving parts for 20 years. You can never

> > be right.
> 
> Maybe you can't ever be right, but that doesn't mean you can't be more

> wrong than usual.  :-)
> 

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg

Follow-Ups:
- [RRG] LISP Fragmentation and Reassembly
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>

References:
- [RRG] LISP-NERD reachability and MTU detection
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- Re: [RRG] LISP-NERD reachability and MTU detection
  - From: Dino Farinacci <dino@cisco.com>
- Re: [RRG] LISP-NERD reachability and MTU detection
  - From: Tony Li <tli@cisco.com>
- Re: [RRG] LISP-NERD reachability and MTU detection
  - From: Dino Farinacci <dino@cisco.com>
- RE: [RRG] LISP-NERD reachability and MTU detection
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
- MTU stuff, was Re: [RRG] LISP-NERD reachability and MTU detection
  - From: Iljitsch van Beijnum <iljitsch@muada.com>

Prev by Date: MTU stuff, was Re: [RRG] LISP-NERD reachability and MTU detection
Next by Date: [RRG] LISP Fragmentation and Reassembly
Previous by thread: MTU stuff, was Re: [RRG] LISP-NERD reachability and MTU detection
Next by thread: [RRG] LISP Fragmentation and Reassembly
Index(es):
- Date
- Thread