[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [RRG] Sprite & IPTM while PMTU probing is in progress



Robin,

I am going to top-post and hit what I think are the high
points; please let me know if I have missed anything:

1) My primary concern with what you are describing is that
by setting the source address of the tunneled packet to that
of the SH (and not that of the ITR) you are effectively
disabling all ICMP error messages (not just the PTBs) for
both the SH and the ITR. The ITR will never get ICMP feedback
from routers within the tunnel, and the SH will get ICMP
feedback that it does not recognize. The lack of ICMP feedback
may have negative implications from a protocol perspective
for both the ITR and the SH. And, I do not know this for sure,
but I believe many if not most SH's will view the unrecognizable
ICMPs as an attack, e.g., warning bells may go off at least,
and more onerous defense mechanisms may kick in in the extreme.
This concern may extend to Ivip in general, and not just the
pathMTU handling.

2) Secondly, while there are routers within the tunnel that
are sending meaningless PTBs (since neither the SH nor ITR
will get them) the tunnel will "brown"-hole. You are not
going to be able to probe the ETR frequently enough to
avoid periods of either silent loss or unmitigated
fragmentation depending on the setting of the DF bit in the
outer IP header. If you tried, the probing overhead would
be excessive and then you still would never be able to probe
frequently enough to avoid all brown-outs (I classify
unmitigated fragmentation as a brown-out even if the
fragments arrive at the ETR).

3) You have not characterized the expected behavior of the
(modified) traceroute used for probing. If there is a path
of N routers, then an unmodified traceroute would need to
send up to N big packets to determine whether they are
reaching the ETR (and, the loss of a big packet at any hop
would result in a significant delay to detect). Also, raw
pings don't work because sending a large echo request causes
an equally-large echo reply to come back - and, the pathMTU
is not necessarily symmetric in the forward and reverse
directions. 

4) Both sprite-mtu and your approach operate in the
presence of fragmentation, but sprite-mtu explicitly
manages fragmentation while yours just lets it fragment.
Sprite-mtu could include a "configuration knob" (per
RFC4213) to set a higher fragmentation threshold than
1280 based on knowledge of the ETR's EMTU_R, but this
would be used at the peril of reassembly misassociations
and potential data corruption if no mitigations were in
place. (BTW, the sprite-mtu trailer is added only when
the tunnel is fragmenting and is there for the purpose
of detecting packet splicing errors.)

Again, please let me know if I missed covering any of
the points you believe are important.

Fred
fred.l.templin@boeing.com

> -----Original Message-----
> From: Robin Whittle [mailto:rw@firstpr.com.au] 
> Sent: Thursday, November 29, 2007 8:59 PM
> To: Routing Research Group list
> Cc: Templin, Fred L
> Subject: Re: [RRG] Sprite & IPTM while PMTU probing is in progress
> 
> Hi Fred,
> 
> You wrote:
> 
> >> I don't clearly understand this.  Iljitsch has not been able to
> >> convince me - or anyone else AFAIK - that it it will be 
> practical to
> >> insist on massive hardware upgrades broad enough for the wide range
> >> of location I think ITRs and ETRs need to be located. 
> (Host to host,
> >> Man - Coast to coast. http://www.firstpr.com.au/ip/ivip/tv-ad/)
> > 
> > Sorry for asking, but do you have a PR department that
> > writes this stuff up? :^}
> 
> It is more a case of adding some colour to keep me awake.
> Routing and addressing is pretty dry, I think.  The LISP folks
> may have had the same experience - it seems they were dreaming
> of Iphones.
> 
> 
> >> Maybe some people envisage a more restricted range, or are more
> >> upbeat about the prospects of a global jumboframes and gigabit
> >> Ethernet upgrade.
> >>
> >> For the purposes of discussing sprite-mtu and IPTM, I assume that
> >> this upgrade is not practical.
> > 
> > IMHO, the H/W upgrades will occur independently of any
> > tunnel MTU handling. But, the tunnel MTU handling should
> > be capable of taking advantage of the larger MTUs when
> > they become available.
> 
> I agree.  I will try to ensure that IPTM can make use of much
> larger MTUs.
> 
> 
> >>> Since the ITR cannot know the EMTU_R of the ETR a priori unless
> >>> there is some spec that says: "all ETRs MUST configure an EMTU_R
> >>> of at least X bytes", the ITR should not simply fragment the
> >>> outer packets (or, allow the network to fragment them) since they
> >>> could black-hole. 
> >>
> >> You and most other people on this list have much more experience in
> >> these matters than I do, but I don't see why fragmenting 
> packets for
> >> a few seconds will lead to a "black hole".
> > 
> > If the ETR does not have a large-enough EMTU_R, I think
> > it would have to drop the packet silently.
> 
> Yes, but this doesn't seem pertinent to my critique of sprite-mtu
> that it would be better to:
> 
> 1 - As IPTM does, fragment the outer packet into two smaller
>     ones, which will pass through the tunnel OK.  Hmm, this
>     embodies an assumption that the outer packet is no bigger
>     than 1500 bytes or so, and that all ITRs and ETRs are to
>     be located in places where they have at least 1280 (or
>     similar) byte MTUs to the DFZ.
> 
> 2 - As sprite-mtu does, try sending the entire outer packet
>     (assuming it fits in the MTU limit of the next hop)
>     while also sending a PTB to the SH, with a value
>     designed to ensure the SH sends only packets which, once
>     encapsulated, will fit within the ITR's current estimate
>     of the PMTU to the ETR.  Initially, this estimate is a
>     low default value, since the ETR probing is not yet
>     complete.
> 
> I have been focusing on hosts which assume they can send
> packets up to 1500 bytes long, and trying to make the ITR-ETR
> system handle these, initially, with fragmentation, but then
> use a PTB (or as many as are required before the SH changes
> its ways) to get the SH to send somewhat smaller packets.
> 
> I need to think more about the future, where hosts might
> send really large packets, many times bigger than 1500 bytes,
> while the ITR still has to assume a figure of ~1280 for the
> MTU to an ETR it hasn't yet probed.
> 
> When fragmentation means many more than 2 fragments, your
> objection to fragmentation make more sense to me.
> 
> Here I will consider jumboframes to be 16114 bytes, according
> to Iljitsch's: http://psg.com/lists/rrg/2007/msg00628.html .
> Jumboframes could be longer, which is more difficult still.
> 
> With ITPM as it stands and a jumboframe-inclined SH, which
> sends ~16114 byte packets, the ITR would fragment it into 11
> packets.  This greatly increases inefficiencies and the chance
> of one being lost - assuming the PMTU to the ETR, and the ETR
> itself, was ready for (16114 + ENCAPS) byte packets.
> 
> My primary goal is to handle existing hosts - which are assumed
> generally not to be jumbo-inclined, and which I guess generally
> put out packets of ~1500 bytes, according to their local link
> MTUs - in an environment with widespread ITR-ETR use, over links
> which may have slightly sub-1500 MTUs themselves, causing the
> ITR-ETR tunneling overhead to limit the payload size to some
> value ~20 to ~100 bytes below 1500.
> 
> I think IPTM does a better job of this than sprite-mtu - because
> I think IPTM's fragmenting of larger packets into two is a better
> course of action than sending a PTB (and also sending the single
> large outer packet).
> 
> A secondary goal - which will be really important in the long
> run - is to craft IPTM so it works well in a world where
> many links, but not all, handle jumboframes, and where hosts
> tend to send jumboframes.
> 
> One approach is to do something to IPTM in a fixed way to
> create happy outcomes in both scenarios.  Another is to
> make some requirement of hosts in the future which are
> inclined to send jumboframes - I would rather avoid this.
> 
> Another is to plan for some overall change to the behavior
> of IPTM devices in the future.  This is assuming that there
> isn't a single way of optimising for both both scenarios,
> so at some point, when it is worthwhile, we switch over to
> an alternative mode of operation, different config settings
> etc. which is worse for the old scenario, but better for
> the jumboframes scenario.
> 
> 
> >> I recognise that fragmentation involves several costs:
> >>
> >> 1 - More work for the routers at each end.
> > 
> > That brings up another question; were you expecting to
> > fragment the packet in the ITR's stack before sending,
> > or just send with DF=0 and let the network fragment?
> > The literature and recent discussion consensus seems
> > to greatly favor the former.
> 
> Hmm ... I was intending the ITR to fragment the packet.
> 
> This was to avoid a situation where there are two or more
> MTU bottlenecks in the path:
> 
>   ITR sends 1 x 6000 byte packet.
> 
>   R1 (seeing MTU to R2 is 3000 bytes) fragments this to
>   3 x 2020 ((6000 / 3) + 20) byte packets.
> 
>   R2 (seeing MTU to R3 is 1500 bytes) fragments these to
>   6 x 1030 ((2220 / 2) + 20) byte packets.
> 
> IPTM would have assumed a ~1280 byte PMTU, until it had
> completed probing the ETR, so it would have fragmented
> the packet to 5 x 1220 byte packets ((6000 / 5) + 20).
> 
> This is marginally more efficient than the above, except
> for where the single packet can travel most of the distance
> before fragmentation.
> 
> Thanks for pointing this out.
> 
> 
> >> 2 - Two packets rather than one being handled by all
> >>     routers en-route.
> > 
> > Or more than 2, if the network also fragments.
> 
> Yes, but for packets which start off around 1500 bytes, and if
> we make a "1280 PMTU to the DFZ" rule for placing all ITRs and
> ETRs, we can assume that the packets will only be fragmented
> once into two.
> 
> 
> >> 3 - More data being sent, due to an extra packet's worth of
> >>     overhead.
> >>
> >> 4 - Most seriously, I think, a greater chance of loss of the total
> >>     packet due to its delivery depending on two packets, 
> rather than
> >>     one.
> >>
> >>     Each packet has some risk of being lost, and the risk is
> >>     a little higher than would otherwise be the case, since the
> >>     two packets take up more time on the wire and more router
> >>     resources than the single original packet.
> >>
> >> 5 - Fragmentation failures due to 16 bit ID wraparound.  You
> >>     mentioned this being a problem with IPv4 and high packet rates
> >>     with longer delays, causing the wrong packets to be 
> reassembled,
> >>     due to the 16 bit counter  wrapping and them having the same
> >>     sequence number.
> >>
> >> Tacking point 5 first, in message 608, quoting Iljitsch, I asked
> >> whether this could be largely resolved by a shorter time window for
> >> reassembly in the ETR:
> >>
> >>  - - - -
> >>
> >>> I don't have any references, but in short, the issue is that you
> >>> have a 16 bit ID space with a reassembly timeout of something
> >>> like a few minutes. This means you can only send 65536 packets
> >>> during that "few minute" window or you'll incorrectly reassemble
> >>> fragments from different packets if you lose a fragment. This is
> >>> especially problematic if the fragmented packets belong to a
> >>> tunnel because in that case the IP source/dest addresses are
> >>> always the same.
> >>
> >> In a fresh system such as an ITR-ETR scheme, perhaps a workaround
> >> for this would be to set the maximum reassembly time at the ETR to
> >> something very much shorter, such as 1, 2 or 3 seconds?
> >>
> >> That could be a pain where the ETR is an additional function in a
> >> server or router with a pre-existing TCP/IP stack.  Then it would
> >> probably not be possible to shorten the time just for the 
> ITR to ETR
> >> packets.
> >>
> >>  - - - -
> >>
> >> Can anyone comment on the prospects of this being an acceptable
> >> solution?
> > 
> > During some of the discussions (can't recall whether it
> > was on- or off-list) some Linux code was shown that uses an
> > out-of-order upper bound in terms of number of reassemblies
> > ourstanding instead of a shortened timer when deciding whether
> > to purge an incomplete reassembly. That seemed like the right
> > way to go to me.
> 
> OK.  This sounds easier to implement than timers.  Hopefully
> someone can rememember some information about this.
> 
> 
> 
> > Setting a shorter timer might not be such a good idea for
> > reasons outlined in RFC1122, Section 3.3.2. In some earlier
> > efforts, I delved into the RFC1122 suggestion of managing
> > round-trip times: 
> > 
> >   + It has been suggested that a cache might be kept of
> >   + round-trip times measured by transport protocols for
> >   + various destinations, and that these values might be used
> >   + to dynamically determine a reasonable reassembly timeout
> >   + value.  Further investigation of this approach is
> >   + required.
> > 
> > My conclusion was that a protocol between the ITR and ETR
> > could be devised to do this, but it was far too complicated
> > for practical purposes and had too many things that could
> > go wrong.
> 
> It is just the ETR which has to be changed.  This raises
> questions of whether the ETR is some special new device
> (unlikely) or a conventional PC-like host, which would
> typically rely on standard OS packet reassembly, which
> could be hard to alter.  A server specifically programmed
> to be an ETR might have tweaked networking code in is OS
> for this purpose.
> 
> Routers may or may not be amenable to tweaking their packet
> reassembly operations just because the router is functioning,
> in part, as an ETR.
> 
> 
> >> In an ITR-ETR scheme, we probably don't want to sit around waiting
> >> for packets which somehow get lost and arrive more than a second
> >> late via some circuitous route through Manangatang.
> > 
> > There is also the consideration of whether we want to make
> > this delay-tolerant, e.g., for multiple satellite hops,
> > interplanetary communications, etc.
> > 
> >> I will read:
> >>
> >>   IPv4 Reassembly Errors at High Data Rates
> >>   http://tools.ietf.org/html/rfc4963
> > 
> > OK.
> > 
> >> The other points don't, to me, constitute a "black hole" or a
> >> serious problem.  They seem to involve marginally greater efforts,
> >> marginally less efficiency and marginally greater chances of the
> >> final packet not being delivered.
> >>
> >> To me, this is a far better thing to do, for a few seconds or so
> >> into a potentially long flow, than dropping packets.
> >>
> >>
> >>> There are also other factors to consider, including that the ITR
> >>> may not have ultimate control over the setting of the ip_id. And,
> >>> the ETR may not be able to receive non-initial fragments in the
> >>> first place. (These factors can be mitigated by the placement of
> >>> the ITR and ETR in some use cases, however.)
> >>
> >> I don't clearly understand these points.  Can you explain 
> >> them further?
> > 
> > If the ITR is located behind a NAT or translating firewall,
> > the ip_id could be re-written. I believe a common approach
> > is to rewrite with a randomly-chosen 16-bit value. When
> > there are many SH's behind the same NAT/firewall and using
> > the same ITR, this could present a problem.
> 
> In Ivip, ITRs are never behind NAT.  I assume the same is
> true of other ITR-ETR schemes.
> 
> > About non-initial fragments, I am told that some NATs/firewalls
> > simply pass the first fragment and drop all others. Problem
> > being that non-initial fragments do not include the transport
> > layer header.
> 
> I don't fully understand this, but I think it doesn't matter
> if we know ITRs are never behind NAT etc.
> 
> 
> >>>> A "long" inner packet is one which, once encapsulated, would
> >>>> exceed the ITR's current best estimate of the PMTU.  This would
> >>>> initially be a default such as 1280 bytes.
> >>>>
> >>>> If this default value is above the minimum for the protocol,
> >>>> eg. 576 for IPv4, then this value of PMTU to the "core" of the
> >>>> net must be available to every ITR and ETR and would be part of
> >>>> the specification of the ITR-ETR scheme.
> >>>
> >>> The pathMTU cannot be known a priori; all that can be known a
> >>> priori is the EMTU_R of the ETR if there is a specification for
> >>> the minimum size.
> >>
> >> Yes.  I envisage some value around 1280 - at least 
> something so that
> >> a 1500, 1520 etc. byte packet would only be fragmented into 
> >> two packets.
> > 
> > Unless some new spec comes along, the only values for
> > EMTU_R that can be assumed are 576 bytes for IPv4 and
> > 1500 bytes for IPv6.
> 
> The new specification would be part of the ITR-ETR scheme's RFC.
> For instance:
> 
>    ITRs and ETRs must be located so that their PMTU to the
>    DFZ is at least 1280 bytes.
> 
> The number would need to be carefully chosen.  I use 1280
> because it is substantially below 1500 - and so able to
> accomodate several layers of tunneling.
> 
> The trick would be to make the number reasonably high, to
> reduce the number of initial packets the IPTM ITR would need
> to fragment, or worry about at all in terms of PMTU, without
> unnecessarily restricting the location of ITRs and ETRs.
> 
> 
> >>>> The default value would be replaced by a higher value once the
> >>>> probe process was complete.  The pattern would be something
> >>>> like this, assuming the SH's initial idea of PMTU to the
> >>>> Destination Host was 1500.  I will assume an ENCAPS overhead of
> >>>> 20 bytes (as with IPv4 Ivip, though other ITR-ETR schemes have
> >>>> higher overheads) and that all ITRs and ETRs are located so
> >>>> they have an MTU of at least 1280 from the DFZ.
> >>>
> >>> Do you mean to say "pathMTU" or "linkMTU"? In terms of "pathMTU",
> >>> do we need to consider links with configurable linkMTUs that
> >>> might have either mis-configured or overly- conservative values?
> >>
> >> I don't understand this clearly enough to respond at present.
> > 
> > May not have been worded very well. A while back, I think
> > it was Iljitsch who indirectly pointed out that any link
> > for which an MTU can be manually configured can also be
> > misconfigured. For example, what if an operator means to
> > set a linkMTU of 4500, but his finger slips and he ends
> > up setting only 450? Flows that use that link would
> > experience the inefficiency - but, they should still work. 
> 
> I don't think a major architectural addition like an ITR-ETR
> scheme with its sprite-mtu or IPTM or whatever PMTU management
> system should be required to cope gracefully with such
> misconfiguation.
> 
> 
> >>    Inner      ITR action on           SH's idea  DH gets
> >>    packet     outer packet            of PMTU
> >>    length     following encapsul-     to DH
> >>               ation of inner packet
> >>
> >>                                       1500
> >>
> >>
> >>    200       Send outer packet -      1500       The packet
> >>              the length is less than
> >>              1280.
> >>
> >>   1500       Fragment packet and      1500       The packet
> >>              commence probing                    (Less efficient
> >>                                                  and more error-
> >>                                                  prone tunnel with
> >>                                                  2 packets instead
> >>                                                  of 1, but this is
> >>                                                  only for a few
> >>                                                  seconds, I hope.)
> >>
> >>   1500       Fragment packet and      1500       The packet
> >>              continue probing
> >>
> >>              ... etc.
> >>
> >>> This could black hole and appear as congestion-related loss to
> >>> the SH.
> >>
> >> As noted above, I foresee only marginal efficiency and reliability
> >> impact, where you see unacceptable packet losses.
> > 
> > It's not so much the loss I am concerned with; IMHO,
> > sustained and unmitigated fragmentation is dangerous
> > even for short periods of time.
> 
> My feeling - and you have much more experience in this than I do -
> is that short bursts of fragmentation into two packets is a better
> choice than instantly telling the sending host to send much shorter
> packets than it probably can send.
> 
> But this is with my assumption the SH is only putting out packets
> about 1550 bytes long.  It doesn't hold if hosts are pushing out
> jumboframe packets over 2000 bytes or so.
> 
> 
> >>              Probing complete:
> >>              PMTU to ETR decided
> >>              to be 1460.
> >>
> >>   1400       Send outer packet -      1500       The packet
> >>              the length is <= 1460.
> >>              (This length would not
> >>              necessarily be sent - it
> >>              is just to show that the
> >>              ITR will now send longer
> >>              packets without frag-
> >>              mentation than before.)
> >>
> >>   1500       Drop the packet and send
> >>              the SH a PTB message
> >>              with value 1440.         1440       Nothing, but the
> >>                                                  ITR is usually
> >>                                                  close to the SH,
> >>                                                  and it doesn't
> >>                                                  take long for...
> >>
> >>> Depending on the placement of the ITR, this PTB might not make it
> >>> back to the SH.
> >>
> >> If there is a filter blocking PTB messages from the ITR to the SH,
> >> then I think ordinary, non-RFC4821, PMTUD would be clobbered anyway
> >> - with or without an ITR-ETR scheme  This situation would not
> >> persist, I think when an ITR-ETR scheme was widely deployed, as I
> >> discussed in:
> >>
> >>   http://psg.com/lists/rrg/2007/msg00636.html
> > 
> > See RFC2923 for a discussion of PTB filtering implications
> > for classical PMTUD. 
> 
> I think I have read this, but I will read it again.
> 
> 
> >>   1440      Send the outer packet -  1440        The packet
> >>             the length is <= 1460                (Now the tunnel
> >>                                                  is handling
> >>                                                  optimal length
> >>                                                  packets.)
> >>
> >>>> This pattern would continue unless the ITR, with periodic
> >>>> probing, decides that the PMTU is less than 1460 (it might do
> >>>> this quickly if it got a PTB message from a router in a new,
> >>>> more MTU-challenged, path to the ETR), and if the SH sends a
> >>>> packet which would be too big for the new lower value of PMTU.
> >>>> Then the ITR would send another another PTB message to the SH,
> >>>> with a lower value than 1440.
> >>>
> >>> There is not strictly any periodic probing needed to detect 
> >>> pathMTU reductions, since the data packets serve as virtual 
> >>> probes. The data packets will be lost and might be considered by
> >>> the SH as congestion-related loss if the PTB can't be translated
> >>> by the ITR and sent back to the SH. 
> >>
> >> In my IPTM proposal, the ITR would only get PTB messages from
> >> routers in the tunnel if the outer source address was that of the
> >> ITR.  This would be the case if IPTM was applied to LISP, eFIT-APT
> >> or TRRP.  Ivip uses the sending host's address in the outer header,
> >> so the ITR would never get a PTB message if the PMTU to the ETR
> >> suddenly became lower than what it had assumed, based on previous
> >> explicit probes acknowledge by the ETR.
> > 
> > I believe this represents a departure from some of the other
> > proposals. 
> 
> It is, for reasons including:
> 
>   1 - Making it relatively easy for the ETR to enforce the
>       ISP's filtering which rejects incoming packets with
>       source addresses from its own network - extending
>       this protection to the inner packet, by refusing to
>       forward to the destination host any packet where the
>       inner source address is different from the outer
>       source address.
> 
>          
> http://tools.ietf.org/html/draft-whittle-ivip-arch-00#section-14.1
> 
>   2 - Enabling the SH to perform traceroute, with a somewhat
>       modified traceroute program.
> 
> but this choice of outer source address = sending host address is
> also tied in with some other decisions in which Ivip is different
> from the other proposals to date.
> 
> 
> > Are you suggesting that the SH will always be
> > globally addressable from within the core? So, there is no
> > locator/id split? No public addressing for the ITR and
> > private addressing for the SH? No IP version mismatches
> > between the SH and ITR?
> 
> The SH may be behind NAT, but since the ITR is betwen the NAT and
> the DFZ, the ITR regards the SH's address as that of the NAT box.
> 
> SHs can be on ordinary addresses, receiving packets directly via
> the current BGP system, or they can be on Ivip-mapped addresses,
> receiving packets via ITRs and an ETR.  In Ivip, the mapped
> address blocks are all advertised in BGP, but with an "anycast"
> arrangement where multiple (hundreds of thousands, for instance)
> ITRDs (full database ITRs) advertise each mapped address block,
> causing packets from sending hosts to go to the nearest such ITRD.
> 
> I don't think of Ivip as a locator-ID separation protocol.  It
> probably is, but I see it as more of an additional system of
> plumbing - to do a much finer, and more rapidly flexible in time,
> job of splitting up the address space and getting packets to
> ETRs near the destination hosts than is practical with BGP.
> 
> 
> > In any event, assuming that the outer source is that of
> > the SH (and not the ITR), then the PTBs coming from within
> > the tunnel and delivered directly to the SH will forever
> > report a too-large MTU because the SH has no way of knowing
> > that an ITR on the path will be inserting ENCAPS overhead.
> > Even worse, the data included in the PTB will include the
> > ENCAPS inserted by the ITR and will be unrecognizable
> > to the SH.
> 
> Yes, so a correctly implemented SH should will ignore them - so
> these PTBs will cause no trouble.
> 
> The SH needs to get a PTB from the ITR, which is what IPTM
> is intended to achieve.
> 
> >> This would cause a black hole for all packets which, once
> >> encapsulated, were longer than the new, lower, PMTU between the ITR
> >> and ETR.  To guard against this, the ITR needs to 
> periodically probe
> >> an ETR to which it is continually sending long packets.  
> There could
> >> be other techniques, such as an ITR-ETR protocol which enables the
> >> ITR to receive acknowledgement of packets it sends, but this gets
> >> pretty complex, and I am hoping to avoid such things.
> > 
> > Unless the period for probing is very short, there is
> > opportunity for a self-sustaining denial-of-service to
> > the SH. 
> 
> I don't see how there could be a DoS attack as such.
> 
> Unless I add something to IPTM to have the ITR receive
> acknowledgements, say every 10 to 60 seconds, from the ETR,
> then the ITR has no way of knowing that the tunnel is
> dropping packets, due to the PMTU to the ETR becoming
> lower than the value discovered during probing.
> 
> Another approach is to have more frequent ITR probing of
> the ETR, as you suggest.
> 
> I intend IPTM to be usable by other schemes than Ivip.  For
> those schemes, the IPTM ITR would usually be able to detect
> that the tunnel PMTU had been reduced, since the ITR would
> get the PTB messages from one of the routers in the tunnel.
> 
> However, this is not robust, because there could be a filter
> between the ETR and ITR which might prevent the PTB packets
> from reaching the ITR.
> 
> IPTM for Ivip doesn't care about any filtering of PTB
> messages between the ITR and the ETR.  It relies entirely
> on explicit probes from the ITR to the ETR.
> 
> 
> >>> But, the ITR will be able to return the correct PTB when the SH
> >>> retransmits. This is the same as for sprite-mtu.
> >>
> >> Once the ITR has a correct idea of the new, lower, PMTU to the ETR,
> >> it can drop (IPTM) or send (sprite-mtu) the packet, and generate a
> >> PTB to the SH when the SH next sends a packet which, once
> >> encapsulated, would be too long for that new PMTU.
> > 
> > This loses out on the opportunity to use the packets
> > sent into the tunnel as probes to detect MTU increases.
> > And, if you are expecting your ITR to periodically
> > re-probe to detect MTU increases in time to satisfy
> > MTU-probing SH, it is going to require excessive probing
> > overhead and may not get the job done in time to avoid
> > black-holing. 
> 
> I am not too fussed about how rapidly the ITR detects an
> increase in PMTU to the ETR.  Delay in discovering this only
> means a period of lost opportunity for increased efficiency.
> 
> Delay in discovering a shorter PMTU is a much more serious
> problem, because that leads to black holes of complete
> packet loss.
> 
> Of course, if RFC4821 is widely implemented, the SH will
> discover the packet loss promptly and try shorter packets.
> 
> I think RFC4821 is basically a good thing.  IPTM doesn't
> get in its way, and in this particular packet loss situation
> which IPTM as currently defined would create, RFC4821 would
> make this limitation of IPTM not much of a problem.
> 
> However, IPTM doesn't require or (I think) otherwise benefit
> from RFC4821.
> 
> 
> >>>> Alternatively, occasional probing by the ITR might discover a
> >>>> higher value of PMTU to this ETR, and the SH could discover
> >>>> this increase by trying its luck with a larger packet - and
> >>>> either having it accepted, or rejected with a PTB containing
> >>>> the new higher value, minus 20.
> >>>
> >>> SHs that don't implement RFC4821 will have to wait for a long
> >>> time before trying a larger packet (RFCs 1191 and 1981 say 10min,
> >>> I believe). 
> >>
> >> Indeed.  This is why my critique of sprite-mtu seems to be
> >> important.  As I understand it, the SH first gets a rather 
> low value
> >> in the PTB message from the ITR, at a part of the exchange at which
> >> an IPTM ITR would have simply fragmented the packet 
> without any PTB,
> >> and commenced or continued probing the ETR.
> > 
> > That depends on what you consider to be "low". Depending
> > on the placement of the ITR, "low" will only go as low
> > as 1280 but may go as large as, e.g., 1480. 
> 
> Ignoring for the moment IPTM's current difficulty in coping
> nicely with SHs and tunnel paths which handle jumboframes, my
> critique of sprite-mtu is one of the relative efficiency,
> between, for instance:
> 
>  1 - IPTM fragmenting the first long packets while it probes
>      the ETR, and then settling down to a nice ca. 1460
>      packet length from the SH, after the ITR sends a PTB
>      message to it.
> 
>  2 - Sprite-mtu, as I understand it, first sending a PTB of
>      ca. 1260, and the SH using this for the next 10 minutes
>      before it is allowed to try sending a longer packet.
> 
> How high can we make this ~1280 default figure?  The higher
> the better, as long as we don't unnecessarily restrict the
> placement of ITRs and ETRs.
> 
> However, since many or most flows of information last less
> than 10 minutes, and since most SHs (in my view) will not
> be using RFC4821, this means that with sprite-mtu, most
> traffic would be stuck with the default value MTU, rather
> than whatever higher value the ITR discovers with a few
> seconds of probing.
> 
> I think sprite-mtu only works well if the SH uses RFC4821.
> 
> An RFC4821 SH won't be negatively affected by the ITR both
> sending it a PTB message (with a PMTU value ca. 1260 or so,
> as per my example - or perhaps (576 - ENCAPS) because
> if the packet it sent was not too long for the tunnel (once
> encapsulated) then it would get end-to-end RFC4821
> confirmation of delivery, and so ignore the PTB message.
> 
> If it took notice of the initial PTB message(s), an RFC4821
> SH is still able to fight its way out of this inefficient
> situation (sending packets of ~1260 or less bytes) by
> trying a longer packet length in a shorter time than allowed
> for non RFC4821 compliant hosts.
> 
> 
> >> If it takes at least 10 minutes for a non-RFC4821 compliant host to
> >> try sending larger packets, then this is a long time for the
> >> communication to be restricted to the shorter packets.
> > 
> > SHs are therefore advised to begin implementing RFC4821.
> > Deployment is incremental and involves only the SH.
> 
> I think this makes the acceptable behavior of the ITR-ETR
> scheme (at least in terms of solving the PMTUD problem
> which bedevils every such scheme) dependent on host changes.
> 
> This is at odds with my vision of incremental deployability.
> 
> I wrote about why I think RFC4821 is a very demanding host
> change which I think is unlikely to be widely adopted in
> the time frame in which an ITR-ETR scheme needs to be
> introduced, which I see as being within 5 years.
> 
> Does anyone argue that RFC4821 adoption in desktops and
> servers, are actually going to be widespread in the next
> 5 or 10 years?  I know it is a new RFC, but what is the
> current status of implementation, in operating systems
> (TCP) and in applications?  Applications can only do it
> if the OS supports it.
> 
> 
> >> I am not assuming widespread adoption of RFC4821 at any time.  It
> >> looks really complex to implement, involving applications and the
> >> TCP layer communicating with a new function in the OS in ways which
> >> were not originally part of the protocol stack.   Writing all this
> >> code, for marginal immediate benefit, and then trying to 
> debug it in
> >> all its possible combinations of applications, live 
> network settings
> >> etc. sounds really, really, complex.
> > 
> > I disagree; active end-to-end involvement in MTU determination
> > is important for the long term.
> 
> I agree it is important in the long term, but is it
> really going to happen in the next few years?
> 
> Sufficiently for it to be widely enough deployed that
> those SHs which don't have it and therefore don't work
> well with sprite-mtu, will experience persistent PMTU
> difficulties as a result of  the introduction of the
> ITR-ETR scheme?
> 
> If there are a significant number of hosts (maybe as
> few as 10% or less) which are not up-to-speed with
> RFC4821, in the early days of introducing an ITR-ETR
> scheme (I hope 2010 or 2011), then I fear this would
> have the the effect of making addresses mapped by
> the ITR-ETR scheme suck - which would create an
> impenetrable barrier to the introduction of the scheme,
> and therefore doom our beloved Internet to eternal
> twilight and doom.  (Or boost the fortunes of Cisco
> et al. as everyone buys new routers to cope with
> millions of DFZ routes - the demand for IPv4 space
> in smaller increments is sure to drive growth in
> advertised prefixes up precipitously in the next
> 5 to 10 years.)
> 
> ...
> 
> >>>> It seems strange to me to send the packet (unfragmented, I
> >>>> assume) while also sending back a PTB message to the sending
> >>>> host.  Wouldn't this cause needless traffic and/or confusing
> >>>> signals to the SH if the outer packet does in fact arrive at
> >>>> the ETR and therefore the inner packet is delivered to the
> >>>> destination host?
> >>>
> >>> To the SH, it would appear that there is a router on the path
> >>> returning inaccurate information. This can happen already today,
> >>> since routers can be misconfigured, and spoofed PTBs can be sent
> >>> from any node in the network.
> >>
> >> It still seems strange, confusing and inefficient to me.
> > 
> > I disagree; there is value in sending the packet into the
> > tunnel on at least three levels: 1) it serves as a virtual
> > probe so that the ITR can detect MTU restrictions further
> > down the tunnel, 2) the packet may be an MTU probe of the
> > SH, 3) Packet delivery ratio may benefit in some use cases. 
> 
> I can see this value, although IPTM at present has no way of
> using encapsualated traffic packets as probes.  I would have
> to add an ITR-ETR protocol for that to be available - and
> that would involve either adding a special header or trailer
> to the encapsulated packet (so the ETR could uniquely
> identify it), or devising some robust scheme by which the ETR
> could tell the ITR exactly which packets it received, based
> entirely on the naturally occuring characteristics of the
> encapsulated packets.  Either option involves the ITR in
> a bunch of record keeping and communications guff with the
> ETR - and I am trying to keep this really lightweight.
> 
> 
> >>> SHs that implement RFC4821 should not have a problem 
> >>> deconflicting the (suspect) PTB information from (authentic) 
> >>> end-to-end feedback from the DH, but should benefit from the PTB
> >>> info when the actual data is not delivered to the DH.
> >>
> >> Yes, but I am assuming that none, or few, hosts will implement
> >> RF4821 any time soon.
> > 
> > Incrementally deployable; touches SH only; realizing
> > larger MTUs gives incentive for deployment.
> >  
> >>> ITRs can help the situation by sending sprites of, e.g., 1500
> >>> bytes into the tunnel early in the process so that most if not
> >>> all SHs that use the tunnel will see a 1500 byte or larger MTU.
> >>
> >> Does "early in the process" mean when only shorter packets have so
> >> far needed to be tunneled to the ETR?
> >>
> >> If so, then the ITRs could be generating large volumes (in 
> bytes) of
> >> probe packets in response to only small traffic flows, and to some
> >> flows which never in fact require PMTU knowledge, since the flows
> >> never actually use long packets.
> > 
> > There is nothing mandated here, and implementations will
> > be evaluated on the merits of their probing strategies. 
> 
> I think an RFC for sprite-mtu or IPTM should give some guidance
> on when to probe, but allow the ITR to make its own decisions
> and/or be configured to suit local conditions.
> 
> 
> >>>> Here I will assume IPv4 only, with 1280 bytes for the default
> >>>> PMTU for every ETR the ITR has not yet probed.  I will also
> >>>> assume an encapsulation overhead of 20, although this would
> >>>> typically be higher for Sprite and non-Ivip ITR-ETR schemes.
> >>> I don't understand "higher for sprite-mtu"?
> 
> >> This was a low-key aside.  In my second example, trying to explain
> >> how I thought sprite-mtu might work, I kept the same 20 byte
> >> encapsulation overhead I used in my first example, which was for
> >> IPTM, assuming Ivip's 20 byte overhead.
> >>
> >> Other ITR-ETR schemes, and I guess most other tunneling schemes
> >> sprite-mtu would be applied to, have higher encapsulation 
> overheads,
> >> I think.
> > 
> > The ENCAPS overhead is orthogonal to the use- or non-use
> > of sprite-mtu.
> 
> I was just explaining my continuance of 20 as the ENCAPs value
> from one example to the next.  However, I understand that
> sprite-mtu does, however, involve extra trailers on some
> traffic packets.
> 
> 
> >>>> If the ITR sends a PTB message to the SH when the first packet
> >>>> (or multiple packets) length exceeds the default PMTU value and
> >>>> then, after probing, decides the PMTU is 1480, then I am
> >>>> concerned that the SH would get contradictory values in these
> >>>> PTB messages.
> >>>>
> >>>> At first the SH would be told to send packets no longer than
> >>>> (1280 - 20 = 1260) and later, it would be told to send packets
> >>>> no longer than (1480 - 20 = 1460).
> >>>
> >>> Note: in the next draft version I would like to rewrite the
> >>> second bullet of Section 5.6.4 as:
> >>>
> >>> o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
> >>>    1280+ENCAPS) bytes and the inner IPv4 packet is no larger than
> >>>    MIN(EMTU_R-ENCAPS, 1280).
> >>
> >> The current version is:
> >>
> >>   o  for IPv4/*/IPv4 tunnels, 'pathMTU' is less than MIN(EMTU_R,
> >>      1280) bytes and the inner IPv4 packet is no larger than
> >>      MIN(EMTU_R, 1280) minus ENCAPS.  (When EMTU_R for the TFE is
> >>      not known, 576 bytes must be assumed.)
> >>
> >> OK.  My eyes are glazing over right now.
> > 
> > Reaching that point also myself...
> 
> Indeed.  Is anyone else keeping up with this epic correspondence?
> 
> 
> >>              SH breaks the message
> >>              into smaller packets
> >>              and retries:
> >>
> >>   1260       Send packet and          1260       The packet
> >>              continue probing
> >>
> >>              ... etc.
> >>
> >>              Probing complete:
> >>              PMTU to ETR decided
> >>              to be 1460.
> >>
> >>> By probing, do you mean by the ITR or by the SH? 
> >>
> >> I meant the ITR sends sprite probes to the ETR.
> > 
> > Do you mean sprite-mtu probes and not traceroutes? One
> > thing I do not understand about your proposal is how
> > you expect the traceroutes to be efficient and converge
> > within a reasonable amount of time? Also, there is no
> > guarantee of getting the ICMPs back from the network
> > middleboxes. Am I missing something?
> 
> This example is to explore my understanding of sprite-mtu.
> 
> Above, I mean that the ITR uses special probe packets to the
> ETR, called "sprites".  I don't recall traceroute being a
> part of your proposal.
> 
> In IPTM, the ITRs or ETRs don't do traceroute.  With Ivip's
> "outer source address = sending host address" approach,
> I propose that a modified traceroute program should be able
> to trace all routers, including those in the tunnel, assuming
> the ITR replicates the TTL value when encapsulating.  This
> would involve moderate changes to the traceroute code to
> recognise the ICMP packets which come back from the tunnel,
> which are for the outer packet, not the inner as sent by
> the traceroute program.  Such a traceroute program would be
> able to depict which routers were in the tunnel, and also
> determine the ETR address, from the ICMP messages coming
> back from routers in the tunnel.  I have only looked at
> this quickly, so perhaps there are problems with this
> proposal.
> 
> Enabling traceroute from the SH, all the way through
> the ITR-ETR tunnel, is not possible with other ITR-ETR
> schemes unless the ITR performs heroics which I think are
> prohibitively demanding.  Traceroute is a valuable
> debugging tool and as far as I know (someone confirmed
> this on-list) no applications rely on traceroute.
> administrative and debugging benefit
> 
> 
> >>> I am assuming that SHs will begin using RFC4821 and will
> >>> probe the path for themselves independent of any probing
> >>> done by the ITR.
> >>
> >> I am not assuming hosts will be any different than they are today.
> >> As far as I know, few, if any, implement RFC4821.
> > 
> > If hosts see the value of larger MTUs, they will begin
> > to deploy RFC4821. (Or, if vendors see the value, they
> > will begin to push out RFC4821 in automated S/W updates.)
> > IMHO, PMTUD cannot be efficiently and correctly handled
> > within network middleboxes alone; end-to-end involvement
> > is needed as well.
> 
> As noted above, I think it is a great idea, but it is
> very complex and requires applications and OS to work
> together.  So the OS framework and TCP stuff needs to be
> done before an application could be made to work with it.
> 
> I have no idea to what extent this is being done, but it
> sounds very complex and likely to happen slowly at best.
> 
> 
> >> If you are assuming this, I think it would be good to make it an
> >> explicit condition you are designing sprite-mtu to function within.
> > 
> > sprite-mtu works independently of RFC4821, but end systems
> > benefit from using RFC4821.
> 
> But without RFC4821 in the SH, as far as I know, the SH
> will be constrained for 10 minutes at least to the PMTU
> value it gets in the first PTB message sent by the ITR.
> 
> This is for a lower value than the ITR would send after a
> few seconds probing.
> 
> So I think an RFC4821 compliant SH is necessary for your
> system to deliver communications with packet lengths
> longer than the low, default, value the ITR must assume
> at first.
> 
> If my understanding is correct, I think it would be good
> to state this in some way in later versions of your ID.
> 
> 
> >>   1260       Send outer packet -      1260       The packet
> >>              the length is <= 1460.
> >>
> >>   1260       SH would probably keep              The packet -
> >>              sending packets of                  but more and
> >>              length <= 1260.  Unless             shorter
> >>               the SH was pushy, it               packets than
> >>              would never discover the            the ITR-ETR
> >>              PMTU it could use was               tunnel can
> >>              in fact 1440.                       handle.
> >>
> >>> IMHO, SHs that use RFC4821 can be "pushy" within reason.
> >>
> >> Yes, but I think it will be a long time before there are many such
> >> hosts.
> > 
> > Why not?
> 
> I already discussed the complex software requirements for
> implementing RFC4821 in hosts, including the establishemnt of
> new paths of two-way communication between applications and
> whatever part of the OS network code is responsible for
> RFC4821.
> 
> I think that requires new standards in OS calls, and I guess
> ideally some OS independent way of writing applications
> which handles this stuff.
> 
> 
> >>> Maybe I should add something about this to the spec?
> >> Yes, I think the more explanation of where numbers like 1280 come
> >> from, the better.
> > 
> > OK.
> 
>   Cheers
> 
>     - Robin
> 
> 

--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg