[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RRG] LISP PMTU & fragmentation problems



Short version:    lisp-06's new material on Path MTU limits includes
                  some text on how to resolve the problems if they
                  are deemed to be bad enough to need a solution
                  within LISP.  I can't understand this text in any
                  way which makes practical sense.

                  Also, it would be good to publish the research
                  which indicates that 1500 byte MTU limits are
                  relatively rare.

                  If the problems of PMTU are deemed not worth
                  solving within LISP, then LISP would be deployed
                  on the assumption that all transit links would
                  be capable of some much higher than 1500 byte
                  PMTU.

                  This would tend to constrain the locations of ITR
                  and ETR functions to be at or near border routers,
                  in order that they have unfettered access to jumboframe
                  capable links to the core of the Internet.

                  This would seem to be a major restriction on the
                  ability of operators to place ITRs and ETRs wherever
                  they like.

                  Likewise, it would seem to reduce the number of
                  devices which could do ITR or ETR functions and
                  thereby lead to bottlenecks and to these devices
                  needing to be large and expensive.


Hi Dino and Team,

Below is the text from draft-farinacci-lisp-06 which concerns path
MTU limits, packets being too long due to encapsulation overhead,
fragmentation etc.

=======

5. Tunnelling Details

   ...

   Since additional tunnel headers are prepended, the packet becomes
   larger and in theory can exceed the MTU of any link traversed from
   the ITR to the ETR.  It is recommended, in IPv4 that packets do not
   get fragmented as they are encapsulated by the ITR.  Instead, the
   packet is dropped and an ICMP Too Big message is returned to the
   source.

   Based on informal surveys of large ISP traffic patterns, it appears
   that most transit paths can accommodate a path MTU of at least 4470
   bytes.  The exceptions, in terms of data rate, number of hosts
   affected, or any other metric are expected to be vanishingly small.

   To address MTU concerns, mainly raised on the RRG mailing list, the
   LISP deployment process will include collecting data during its pilot
   phase to either verify or refute the assumption about minimum
   available MTU.  If the assumption proves true and transit networks
   with links limited to 1500 byte MTUs are corner cases, it would seem
   more cost-effective to either upgrade or modify the equipment in
   those transit networks to support larger MTUs or to use existing
   mechanisms for accommodating packets that are too large.

   For this reason, there is currently no plan for LISP to add an
   additional, complex mechanism for implementing fragmentation and
   reassembly in the face of limited-MTU transit links.  If analysis
   during LISP pilot deployment reveals that the assumption of
   essentially ubiquitous, 4470+ byte transit path MTUs, is incorrect,
   then LISP can be modified prior to protocol standardization to add
   support for one of the proposed fragmentation and reassembly schemes.
   Note that one simple scheme is detailed in Section 5.4.


5.4. Dealing with Large Encapsulated Packets

   In the event that the MTU issues mentioned above prove to be more
   serious than expected, this section proposes a simple and stateless
   mechanism to deal with large packets.  The mechanism is described as
   follows:

   1.  Define an architectural constant S for the maximum size of a
       packet, in bytes, an ITR would receive from a source inside of
       its site.

   2.  Define L to be the maximum size, in bytes, a packet of size S
       would be after the ITR prepends the LISP header, UDP header, and
       outer network layer header of size H.

   3.  Calculate: S + H = L.

   When an ITR receives a packet of size greater than L on a site-facing
   interface and that packet needs to be encapsulated, it resolves the
   MTU issue by first splitting the original packet into 2 equal-sized
   fragments.  A LISP header is then pre-pended to each fragment.  This
   will ensure that the new, encapsulated packets are of size (S/2 + H),
   which is always below the effective tunnel MTU.

   When an ETR receives encapsulated fragments, it treats them as two
   individually encapsulated packets.  It strips the LISP headers then
   forwards each packet to the destination host of the destination site.
   The two fragments are reassembled at the destination host into the
   single IP datagram that was originated by the source host.

   This behavior is performed by the ITR when the source host originates
   a packet when the DF field of the IP header is set to 0.  When the DF
   field of the IP header is set to 1, or the packet is an IPv6 packet
   originated by the source host, the ITR will drop the packet when the
   size is greater than L, and sends an ICMP Too Big message to the
   source with a value of S, where S is (L - H).

   This specification recommends that L be defined as 1500.

=======

I will deal first with the section 5.4 material.  I will assume a header
size of 36 bytes for IPv4.  This is 20 bytes for the IPv4 header, 8 for
the UDP header and 8 for the LISP Locator Reach Bits and nonce.  In fact,
you need to allow more than this, depending on whether you need 4 bytes
for up to 31 Locator Reach Bits, 8 bytes for 62 etc.  Similarly, for IPv6
I assume a 56 byte overhead.

Jumboframes can be up to 16110 bytes, according to a cavalcade of MTU
values compiled by Joe St Sauver:

  http://darkwing.uoregon.edu/~joe/jumbo-clean-gear.html

According to:

  http://www.cisco.com/warp/public/121/mtu_atm.html

the figure of 4470 bytes was chosen to match FDDI and HSSI.


My understanding of what you wrote in 5.4 is:

IPv4 with Don't Fragment set to 0:
----------------------------------

> This specification recommends that L be defined as 1500.

> 2.  Define L to be the maximum size, in bytes, a packet of size S
>     would be after the ITR prepends the LISP header, UDP header, and
>     outer network layer header of size H.
>
> 3.  Calculate: S + H = L.

L = 1500
H =   36
S = 1464

> 1.  Define an architectural constant S for the maximum size of a
>     packet, in bytes, an ITR would receive from a source inside of
>     its site.

S is 1464 bytes.  But an ITR could receive a packet of any length from a
source inside its site.  So this sentence makes no proper sense to me.


>  When an ITR receives a packet of size greater than L on a site-facing
>  interface and that packet needs to be encapsulated, it resolves the
>  MTU issue by first splitting the original packet into 2 equal-sized
>  fragments.  A LISP header is then pre-pended to each fragment.

I guess you meant 'S' (1464) rather than 'L' (1500).

>  This
>  will ensure that the new, encapsulated packets are of size (S/2 + H),
>  which is always below the effective tunnel MTU.

This last sentence would only be true (assuming PMTU to the ETR of 1500
bytes, whereas it could be somewhat lower than this) if the ITR had an
admission limit of 2908 bytes. I am ignoring Ethernet headers and maybe
some other details here, so these figures may not be precise.

 Traffic packet:

    2908 bytes = 20 bytes IPv4 header + 2888 bytes payload.


 Fragments into two packets:

    1464 bytes =     20 bytes IPv4 header
                 + 1444 bytes payload


 Add LISP outer header:

    1500 bytes =     36 bytes LISP header
                 +   20 bytes IPv4 header
                 + 1444 bytes payload

The problem is that this paragraph:

>  When an ITR receives a packet of size greater than L on a site-facing
>  interface and that packet needs to be encapsulated, it resolves the
>  MTU issue by first splitting the original packet into 2 equal-sized
>  fragments.  A LISP header is then pre-pended to each fragment.

doesn't specify an upper limit to the packet which will be handled in this
way.

Since you only allow two fragments, I think there needs to be some upper
limit on the admitted packet size.

However, all this assumes that the ITR has a 1500 byte PMTU to the ETR.
In many cases, the PMTU will be a lot higher.  So the above algorithm does
not allow the ITR to send longer packets without fragmentation.

I think any solution needs to enable larger than ~1500 byte traffic
packets to be encapsulated and sent without fragmentation, assuming the
PMTU to the ETR allows this.  This will be complex, since the ITR needs to
discover this PMTU, and can't do this instantly.  Fred Templin and I have
approaches to this:

  http://www.firstpr.com.au/ip/ivip/pmtud-frag/
  http://tools.ietf.org/html/draft-templin-seal

I want to update my proposal somewhat, based on some of Fred's ideas.


IPv6 or IPv4 with DF = 1
------------------------

>  ...  the ITR will drop the packet when the size is greater than L, and
>  sends an ICMP Too Big message to the source with a value of S, where S
>  is (L - H).

This makes no sense to me.  It would make more sense if this occurred when
the packet length was greater than S (1464 bytes for IPv4 or 1444 for IPv6).

But you still have the problem that the system makes no use of longer
PMTUs, which will frequently exist, to the ETR.

I think there is no way around the thorny task of the ITR having to
discover the PMTU to the ETR, and while it is doing that, there remains
the problem of what to do with packets which would need to be fragmented
in order to be delivered if the PMTU was (as must be assumed) only 1500 bytes.



Research and upgrading transit networks . . .

>  Based on informal surveys of large ISP traffic patterns, it appears
>  that most transit paths can accommodate a path MTU of at least 4470
>  bytes.

It would be good for this research to be made public.

> The exceptions, in terms of data rate, number of hosts affected, or any
> other metric are expected to be vanishingly small.

Again, this needs to be justified with arguments and observations.  By the
time a map-encap scheme could be implemented, we can expect gigabit
Ethernet to be more widely deployed.  Also, if there is some forewarning
that the new map-encap scheme doesn't work well with PMTU of 1500 or so
bytes, then this might be an incentive for people to upgrade their
systems.  However, it would also be an incentive not to use the map-encap
scheme.

>  To address MTU concerns, mainly raised on the RRG mailing list, the
>  LISP deployment process will include collecting data during its pilot
>  phase to either verify or refute the assumption about minimum
>  available MTU.

I don't think a pilot scheme would be very useful for this, since it would
only involve a small number of ASes.  A better approach would be a
thorough survey of all links between ASes.

Still, even if you can prove that all links have a PMTU of 4470 or
whatever, your plan of ignoring PMTU and fragmentation problems still
leaves the ITR with the task of handling traffic packets which would be
short enough if sent directly on the transit link (meaning that some hosts
will do this, and can do it fine for non-EID destinations), but will be
too long after the addition of the LISP header.

The ITR really needs to solve this problem, since if a packet is dropped
en-route to the ETR due to some PMTU limit, the sending host will not hear
about it.  I wrote a long piece on the list last year about why it is
impractical for the ITR to handle ICMP PTB messages coming back from the
tunnel so as to generate valid PTB messages to the sending host from which
the traffic packet originated.

You could, perhaps, assume some fixed PMTU of 4470 or whatever, but then
in the future, when some or many transit links support 64k byte packets or
the like, you are still going to be stuck limiting traffic packets to 4470
minus the length of the LISP header.

>  If the assumption proves true and transit networks
>  with links limited to 1500 byte MTUs are corner cases, it would seem
>  more cost-effective to either upgrade or modify the equipment in
>  those transit networks to support larger MTUs or to use existing
>  mechanisms for accommodating packets that are too large.

OK - who is going to pay for truckloads of new routers or line cards?

This is before LISP is introduced, so there no-one is benefiting yet.

More likely, LISP would be resisted or rejected due to its incompatibility
with a small but significant subset of transit links.

I still think your preferred approach involves a heavy cost of
inflexibility and bottlenecks at border routers - because you force all
ITRs and ETRs to be on (in effect) gigabit links to the outside world.


   - Robin              http://www.firstpr.com.au/ip/ivip/


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg