[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [RRG] Sprite & IPTM while PMTU probing is in progress

To: Routing Research Group list <rrg@psg.com>
Subject: Re: [RRG] Sprite & IPTM while PMTU probing is in progress
From: Robin Whittle <rw@firstpr.com.au>
Date: Thu, 29 Nov 2007 23:35:55 +1100
Cc: "Templin, Fred L" <Fred.L.Templin@boeing.com>
In-reply-to: <39C363776A4E8C4A94691D2BD9D1C9A1029EDC5F@XCH-NW-7V2.nw.nos.boeing.com>
Organization: First Principles
References: <474CCDC0.6070604@firstpr.com.au> <39C363776A4E8C4A94691D2BD9D1C9A1029EDC5F@XCH-NW-7V2.nw.nos.boeing.com>
User-agent: Thunderbird 2.0.0.9 (Windows/20071031)
Hi Fred,

Thanks for your response.  You wrote, in part:

> On- and off-list discussions have explored the idea of requiring
> a 2KB EMTU_R on all ETRs and accommodating all 1500-byte and
> smaller packets; even if fragmentation is needed and the inner
> DF=1. This would uphold the "principle of least surprise" to the
> SH, but the issue comes in knowing the EMTU_R of the ETR and in
> avoiding excessive fragmentation.

I don't clearly understand this.  Iljitsch has not been able to
convince me - or anyone else AFAIK - that it it will be practical to
insist on massive hardware upgrades broad enough for the wide range
of location I think ITRs and ETRs need to be located. (Host to host,
Man - Coast to coast. http://www.firstpr.com.au/ip/ivip/tv-ad/)

Maybe some people envisage a more restricted range, or are more
upbeat about the prospects of a global jumboframes and gigabit
Ethernet upgrade.

For the purposes of discussing sprite-mtu and IPTM, I assume that
this upgrade is not practical.


> Since the ITR cannot know the EMTU_R of the ETR a priori unless
> there is some spec that says: "all ETRs MUST configure an EMTU_R
> of at least X bytes", the ITR should not simply fragment the
> outer packets (or, allow the network to fragment them) since they
> could black-hole. 

You and most other people on this list have much more experience in
these matters than I do, but I don't see why fragmenting packets for
a few seconds will lead to a "black hole".

I recognise that fragmentation involves several costs:

1 - More work for the routers at each end.

2 - Two packets rather than one being handled by all
    routers en-route.

3 - More data being sent, due to an extra packet's worth of
    overhead.

4 - Most seriously, I think, a greater chance of loss of the total
    packet due to its delivery depending on two packets, rather than
    one.

    Each packet has some risk of being lost, and the risk is
    a little higher than would otherwise be the case, since the
    two packets take up more time on the wire and more router
    resources than the single original packet.

5 - Fragmentation failures due to 16 bit ID wraparound.  You
    mentioned this being a problem with IPv4 and high packet rates
    with longer delays, causing the wrong packets to be reassembled,
    due to the 16 bit counter  wrapping and them having the same
    sequence number.

Tacking point 5 first, in message 608, quoting Iljitsch, I asked
whether this could be largely resolved by a shorter time window for
reassembly in the ETR:

 - - - -

> I don't have any references, but in short, the issue is that you
> have a 16 bit ID space with a reassembly timeout of something
> like a few minutes. This means you can only send 65536 packets
> during that "few minute" window or you'll incorrectly reassemble
> fragments from different packets if you lose a fragment. This is
> especially problematic if the fragmented packets belong to a
> tunnel because in that case the IP source/dest addresses are
> always the same.

In a fresh system such as an ITR-ETR scheme, perhaps a workaround
for this would be to set the maximum reassembly time at the ETR to
something very much shorter, such as 1, 2 or 3 seconds?

That could be a pain where the ETR is an additional function in a
server or router with a pre-existing TCP/IP stack.  Then it would
probably not be possible to shorten the time just for the ITR to ETR
packets.

 - - - -

Can anyone comment on the prospects of this being an acceptable
solution?

In an ITR-ETR scheme, we probably don't want to sit around waiting
for packets which somehow get lost and arrive more than a second
late via some circuitous route through Manangatang.

I will read:

  IPv4 Reassembly Errors at High Data Rates
  http://tools.ietf.org/html/rfc4963

The other points don't, to me, constitute a "black hole" or a
serious problem.  They seem to involve marginally greater efforts,
marginally less efficiency and marginally greater chances of the
final packet not being delivered.

To me, this is a far better thing to do, for a few seconds or so
into a potentially long flow, than dropping packets.


> There are also other factors to consider, including that the ITR
> may not have ultimate control over the setting of the ip_id. And,
> the ETR may not be able to receive non-initial fragments in the
> first place. (These factors can be mitigated by the placement of
> the ITR and ETR in some use cases, however.)

I don't clearly understand these points.  Can you explain them further?


>> A "long" inner packet is one which, once encapsulated, would
>> exceed the ITR's current best estimate of the PMTU.  This would
>> initially be a default such as 1280 bytes.
>> 
>> If this default value is above the minimum for the protocol,
>> eg. 576 for IPv4, then this value of PMTU to the "core" of the
>> net must be available to every ITR and ETR and would be part of
>> the specification of the ITR-ETR scheme.
> 
> The pathMTU cannot be known a priori; all that can be known a
> priori is the EMTU_R of the ETR if there is a specification for
> the minimum size.

Yes.  I envisage some value around 1280 - at least something so that
a 1500, 1520 etc. byte packet would only be fragmented into two packets.


>> The default value would be replaced by a higher value once the
>> probe process was complete.  The pattern would be something
>> like this, assuming the SH's initial idea of PMTU to the
>> Destination Host was 1500.  I will assume an ENCAPS overhead of
>> 20 bytes (as with IPv4 Ivip, though other ITR-ETR schemes have
>> higher overheads) and that all ITRs and ETRs are located so
>> they have an MTU of at least 1280 from the DFZ.
> 
> Do you mean to say "pathMTU" or "linkMTU"? In terms of "pathMTU",
> do we need to consider links with configurable linkMTUs that
> might have either mis-configured or overly- conservative values?

I don't understand this clearly enough to respond at present.

(My example is not in quotes - just your response.)

   Inner      ITR action on           SH's idea  DH gets
   packet     outer packet            of PMTU
   length     following encapsul-     to DH
              ation of inner packet

                                      1500


   200       Send outer packet -      1500       The packet
             the length is less than
             1280.

  1500       Fragment packet and      1500       The packet
             commence probing                    (Less efficient
                                                 and more error-
                                                 prone tunnel with
                                                 2 packets instead
                                                 of 1, but this is
                                                 only for a few
                                                 seconds, I hope.)

  1500       Fragment packet and      1500       The packet
             continue probing

             ... etc.

> This could black hole and appear as congestion-related loss to
> the SH.

As noted above, I foresee only marginal efficiency and reliability
impact, where you see unacceptable packet losses.

             Probing complete:
             PMTU to ETR decided
             to be 1460.

  1400       Send outer packet -      1500       The packet
             the length is <= 1460.
             (This length would not
             necessarily be sent - it
             is just to show that the
             ITR will now send longer
             packets without frag-
             mentation than before.)

  1500       Drop the packet and send
             the SH a PTB message
             with value 1440.         1440       Nothing, but the
                                                 ITR is usually
                                                 close to the SH,
                                                 and it doesn't
                                                 take long for...


> Depending on the placement of the ITR, this PTB might not make it
> back to the SH.

If there is a filter blocking PTB messages from the ITR to the SH,
then I think ordinary, non-RFC4821, PMTUD would be clobbered anyway
- with or without an ITR-ETR scheme  This situation would not
persist, I think when an ITR-ETR scheme was widely deployed, as I
discussed in:

  http://psg.com/lists/rrg/2007/msg00636.html


  1440      Send the outer packet -  1440        The packet
            the length is <= 1460                (Now the tunnel
                                                 is handling
                                                 optimal length
                                                 packets.)

>> This pattern would continue unless the ITR, with periodic
>> probing, decides that the PMTU is less than 1460 (it might do
>> this quickly if it got a PTB message from a router in a new,
>> more MTU-challenged, path to the ETR), and if the SH sends a
>> packet which would be too big for the new lower value of PMTU.
>> Then the ITR would send another another PTB message to the SH,
>> with a lower value than 1440.

> There is not strictly any periodic probing needed to detect 
> pathMTU reductions, since the data packets serve as virtual 
> probes. The data packets will be lost and might be considered by
> the SH as congestion-related loss if the PTB can't be translated
> by the ITR and sent back to the SH. 

In my IPTM proposal, the ITR would only get PTB messages from
routers in the tunnel if the outer source address was that of the
ITR.  This would be the case if IPTM was applied to LISP, eFIT-APT
or TRRP.  Ivip uses the sending host's address in the outer header,
so the ITR would never get a PTB message if the PMTU to the ETR
suddenly became lower than what it had assumed, based on previous
explicit probes acknowledge by the ETR.

This would cause a black hole for all packets which, once
encapsulated, were longer than the new, lower, PMTU between the ITR
and ETR.  To guard against this, the ITR needs to periodically probe
an ETR to which it is continually sending long packets.  There could
be other techniques, such as an ITR-ETR protocol which enables the
ITR to receive acknowledgement of packets it sends, but this gets
pretty complex, and I am hoping to avoid such things.

> But, the ITR will be able to return the correct PTB when the SH
> retransmits. This is the same as for sprite-mtu.

Once the ITR has a correct idea of the new, lower, PMTU to the ETR,
it can drop (IPTM) or send (sprite-mtu) the packet, and generate a
PTB to the SH when the SH next sends a packet which, once
encapsulated, would be too long for that new PMTU.


>> Alternatively, occasional probing by the ITR might discover a
>> higher value of PMTU to this ETR, and the SH could discover
>> this increase by trying its luck with a larger packet - and
>> either having it accepted, or rejected with a PTB containing
>> the new higher value, minus 20.
> 
> SHs that don't implement RFC4821 will have to wait for a long
> time before trying a larger packet (RFCs 1191 and 1981 say 10min,
> I believe). 

Indeed.  This is why my critique of sprite-mtu seems to be
important.  As I understand it, the SH first gets a rather low value
in the PTB message from the ITR, at a part of the exchange at which
an IPTM ITR would have simply fragmented the packet without any PTB,
and commenced or continued probing the ETR.

If it takes at least 10 minutes for a non-RFC4821 compliant host to
try sending larger packets, then this is a long time for the
communication to be restricted to the shorter packets.

I am not assuming widespread adoption of RFC4821 at any time.  It
looks really complex to implement, involving applications and the
TCP layer communicating with a new function in the OS in ways which
were not originally part of the protocol stack.   Writing all this
code, for marginal immediate benefit, and then trying to debug it in
all its possible combinations of applications, live network settings
etc. sounds really, really, complex.


> SHs that implement RFC4821 can retry end-to-end probing more 
> frequently than that, since loss of a probe does not
> expose data to silent loss.

Yes.


>> I think Sprite will fragment outer packets, as does IPTM, 
>> irrespective of the DF flag of the inner packet.
> 
> That is true.

OK.


>> The criteria for which IPv4 outer packets are fragmentable is
>> complex (5.6.4).
> 
> That text borrows from the tunnel MTU and fragmentation 
> discipline set forth in RFC4213, Section 3.2. But, I don't know
> what you mean by complex?

What I meant to write was that when meant I tried to read this and
the rest of the ID in order to answer the questions I tried to pose
in my second example, I couldn't understand it clearly enough to
envisage how your proposal would operate.

My questions were something like:

  Does the ITR first send a rather low value in the PTB message (for
  the first "long" packet)?

  While probing is taking place, does the ITR generate PTBs for
  packets which are too long for its currently too low PMTU
  estimate (and sends the packet too, without fragmentation), or
  does it fragment them, like IPTM?

I think the answers are:

  Yes.

  The ITR generates PTBs and sends the packets holus bolus.


>> I am not sure how Sprite handles "large" outer packets while it
>> is probing.  Does it fragment them as IPTM does?  Or does it do
>> the following, which is the same as what it does for a "long"
>> packet once the PMTU has been reliably ascertained:
>> 
>> (5.6.5) "... admits the packet but also sends a PTB message
>> ..."
> 
> It does the latter; this borrows from RFC2003, Section 5.

OK, without carefully reading RFC2003, which I would if I had more
time right now, I understand that the ITR both sends the complete
outer packet, without fragmentation, and sends a PTB message to the SH.

>> It seems strange to me to send the packet (unfragmented, I
>> assume) while also sending back a PTB message to the sending
>> host.  Wouldn't this cause needless traffic and/or confusing
>> signals to the SH if the outer packet does in fact arrive at
>> the ETR and therefore the inner packet is delivered to the
>> destination host?
> 
> To the SH, it would appear that there is a router on the path
> returning inaccurate information. This can happen already today,
> since routers can be misconfigured, and spoofed PTBs can be sent
> from any node in the network.

It still seems strange, confusing and inefficient to me.


> SHs that implement RFC4821 should not have a problem 
> deconflicting the (suspect) PTB information from (authentic) 
> end-to-end feedback from the DH, but should benefit from the PTB
> info when the actual data is not delivered to the DH.

Yes, but I am assuming that none, or few, hosts will implement
RF4821 any time soon.


> ITRs can help the situation by sending sprites of, e.g., 1500
> bytes into the tunnel early in the process so that most if not
> all SHs that use the tunnel will see a 1500 byte or larger MTU.

Does "early in the process" mean when only shorter packets have so
far needed to be tunneled to the ETR?

If so, then the ITRs could be generating large volumes (in bytes) of
probe packets in response to only small traffic flows, and to some
flows which never in fact require PMTU knowledge, since the flows
never actually use long packets.


>> Here I will assume IPv4 only, with 1280 bytes for the default
>> PMTU for every ETR the ITR has not yet probed.  I will also
>> assume an encapsulation overhead of 20, although this would
>> typically be higher for Sprite and non-Ivip ITR-ETR schemes.
> 
> I don't understand "higher for sprite-mtu"?

This was a low-key aside.  In my second example, trying to explain
how I thought sprite-mtu might work, I kept the same 20 byte
encapsulation overhead I used in my first example, which was for
IPTM, assuming Ivip's 20 byte overhead.

Other ITR-ETR schemes, and I guess most other tunneling schemes
sprite-mtu would be applied to, have higher encapsulation overheads,
I think.


>> If the ITR sends a PTB message to the SH when the first packet
>> (or multiple packets) length exceeds the default PMTU value and
>> then, after probing, decides the PMTU is 1480, then I am
>> concerned that the SH would get contradictory values in these
>> PTB messages.
>>
>> At first the SH would be told to send packets no longer than
>> (1280 - 20 = 1260) and later, it would be told to send packets
>> no longer than (1480 - 20 = 1460).
>
> Note: in the next draft version I would like to rewrite the
> second bullet of Section 5.6.4 as:
>
> o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
>    1280+ENCAPS) bytes and the inner IPv4 packet is no larger than
>    MIN(EMTU_R-ENCAPS, 1280).

The current version is:

  o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
     1280) bytes and the inner IPv4 packet is no larger than
     MIN(EMTU_R, 1280) minus ENCAPS.  (When EMTU_R for the TFE is
     not known, 576 bytes must be assumed.)

OK.  My eyes are glazing over right now.  I will probably be able
to understand this in its full context in the future.


>> But if the SH took notice of the first PTB message, it probably
>> wouldn't send any longer packets which would trigger the
>> second. So, if my understanding of Sprite is correct, the SH
>> would experience something like this:

   Inner      ITR action on           SH's idea  DH gets
   packet     outer packet            of PMTU
   length     following encapsul-     to DH
              ation of inner packet

                                      1500


   200       Send outer packet -      1500       The packet
             the length is less than
             1280.

> Probing can start here also.

As I noted above, at this point the ITR can't reasonably assume that
it will need PMTU information to the ETR.  It is just a short packet.

Perhaps smart ITRs could have heuristics to help them decide when to
start probing earlier, based on prior statistics (of all sorts of
aspects of packets that the ITR might analyse) of shorter packets
often being followed by longer packets.  But that is not something I
would require of an IPTM ITR.


  1500       Send packet and                     Probably nothing
             commence probing                    - however, if the
             Send PTB with value                 packet was a little
             1260                     1260       shorter, such as
                                                 1440, then it may
                                                 arrive at the ETR
                                                 in one piece,
                                                 despite the SH
                                                 being told it was
                                                 too big.

> With the note above, the returned size would be 1280; not 1260.
> Also, I'm not sure as to the "probably nothing" as linkMTUs
> increase above 1500 (perhaps someone could send the IEEE
> reference that proposes the increase for 802 linkMTUs).

OK - I guess in the future it might be worth trying longer than 1500
byte packets to the ETR.

My current IPTM proposal is not aimed at optimising the
opportunities for doing so.  I will think about this whenever I
revise it.

             SH breaks the message
             into smaller packets
             and retries:

  1260       Send packet and          1260       The packet
             continue probing

             ... etc.

             Probing complete:
             PMTU to ETR decided
             to be 1460.

> By probing, do you mean by the ITR or by the SH? 

I meant the ITR sends sprite probes to the ETR.

> I am assuming that SHs will begin using RFC4821 and will
> probe the path for themselves independent of any probing
> done by the ITR.

I am not assuming hosts will be any different than they are today.
As far as I know, few, if any, implement RFC4821.

If you are assuming this, I think it would be good to make it an
explicit condition you are designing sprite-mtu to function within.


  1260       Send outer packet -      1260       The packet
             the length is <= 1460.

  1260       SH would probably keep              The packet -
             sending packets of                  but more and
             length <= 1260.  Unless             shorter
              the SH was pushy, it               packets than
             would never discover the            the ITR-ETR
             PMTU it could use was               tunnel can
             in fact 1440.                       handle.

> IMHO, SHs that use RFC4821 can be "pushy" within reason.

Yes, but I think it will be a long time before there are many such
hosts.



> I would like to add one other note about the 1280. That number
> comes from the SHOULD in RFC4213, Section 3.2.1. The reason I am
> taking the SHOULD is that there can be additional encapsulations
> on the path between the ITR and ETR (e.g., tunnel-mode IPsec) and
> we don't want to cause fragmentation for those, either. If the
> ITR and ETR are arranged such that there will be no additional 
> encapsulations on the path (and the ITR has a way of knowing
> this) then the spec could use the RFC4213, Section 3.2.1
> "configuration knob" to push the 1280 up to as much as 1480 or
> perhaps even 1500. I would not want to go any higher than this,
> since it could involve excessive fragmentation resulting in
> undetected data corruption. 

OK - I won't try to chase this up right now, but thanks for the
pointers.

> Maybe I should add something about this to the spec?

Yes, I think the more explanation of where numbers like 1280 come
from, the better.

Thanks again for this detailed discussion.


  - Robin     http://www.firstpr.com.au/ip/ivip/pmtud-frag/




--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg
Follow-Ups:
- RE: [RRG] Sprite & IPTM while PMTU probing is in progress
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
References:
- [RRG] Sprite & IPTM while PMTU probing is in progress
  - From: Robin Whittle <rw@firstpr.com.au>
- [RRG] RE: Sprite & IPTM while PMTU probing is in progress
  - From: "Templin, Fred L" <Fred.L.Templin@boeing.com>
Prev by Date: Re: [RRG] Idea for shooting down
Next by Date: RE: [RRG] Sprite & IPTM while PMTU probing is in progress
Previous by thread: [RRG] RE: Sprite & IPTM while PMTU probing is in progress
Next by thread: RE: [RRG] Sprite & IPTM while PMTU probing is in progress
Index(es):
- Date
- Thread