[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RRG] Path MTU Discovery: a new approach



Hi Bill,

Thanks for your summary, which is correct in many respects - pretty
good going if you only read the list message, rather than the page
itself:

  http://www.firstpr.com.au/ip/ivip/pmtud-frag/

> 1. Have the ITR maintain an "uncertainty zone" for sizes of packets
> that can be sent to a given ETR. The uncertainty zone is bounded by a
> size previously determined to be smaller than or equal to the actual
> PMTU (LPME) and a size previously determined to be larger than the
> actual PMTU (UPME).

Yes.


> 2. The ITR encapsulates and transmits packets smaller than LPME
> normally. 

Yes, except the ITR should probably send a few such packets with
RPD2 (BTW, if anyone can think of a better acronym  ...) to explore
the possibility that the Real PMTU is now lower than LPME or higher
than UPME.  These need to be rate limited.  Most of the time, there
will be no such change from month-to-month, but sometimes there will be.


> It rejects packets larger than UPME immediately with a too-big
> message.

Yes, except for occasionally where it uses one as an explorative
probe to detect if the Real PMTU has risen above UPME.  If the
packet is not delivered, then it sends a PTB to the SH as you
describe, with MTU value equal to UPME.  (If it gets a PTB from the
tunnel, the MTU in that PTB is used to set an upper limit on UPME.)

The only packets which are always rejected with a PTB are those
which, once encapsulated, would exceed the MTU of the interface the
ITR uses to send packets to this ETR.


> 3. If the packet size is in the uncertainty zone, encapsulate it with
> RPD2 instead of the normal encapsulation and hold the original packet
> until the ETR responds. This encapsulation consists of two packets:
> one in the uncertainty zone and one smaller than LPME. 

Actually, the small one will not only be smaller than LPME, it will
be way smaller than some figure like 1200 bytes, which we assume can
be sent from any ITR to any ETR without PMTU problems.

> If successfully transmitted, the ETR will reassemble the two packets into 
> one before passing them on.

Yes - if the ETR receives the big Packet B and at least one small
Packet A.

This is true except for the just mentioned occasional exploratory
probe packets of length longer then UPME or shorter than LPME.


> 4. The ETR is required to respond to the ITR with information about
> all communications associated with RPD2, in addition to delivering the
> packets. By comparing the ETR's response to the RPD2 messages with the
> RPD2 messages it sent, the ITR can narrow the uncertainty zone until
> LPME and UPME meet.
> 
> Please correct any part of that I misunderstood.

There a few other points.

1 - Packet B, the large one, is sent with its outer header's source
    address set to the ITR's address.  This is true in all instances
    or RPD2, including Ivip.  In Ivip, the Packet As are sent with
    their outer source address being that of the SH.

2 - Therefore if Packet B gets to a router in the ITR --> ETR tunnel
    with an outgoing MTU which is too small for it, the ITR will
    receive a Packet Too Big message.  (Except if the Packet B or
    the PTB packet are dropped for some random reason, or if the PTB
    is blocked by a filter.  A BCP will say: Don't put your ITRs and
    ETRs behind such filters.)

3 - When the ITR gets a PTB from the tunnel, is told by the ETR that
    the Packet B didn't arrive in a reasonable, but short,
    time-frame (maybe try twice) it sends a PTB back to the
    Sending Host (SH) - so the SH will try again, with a smaller
    packet, and no data should be lost to the application.

4 - If the ITR simply gets back from the ETR, it might try again.
    I am not sure what the ITR would do then, but I don't think it
    should be adjusting down its UPME variable, or sending PTBs to
    the SH, just because it can't get a report of any kind from the
    ETR.  This is probably a temporary glitch.  If it is permanent,
    then there's no point in sending a PTB anyway, since the data
    will never get to this ETR, at least via this ITR.

Also, the ITR always* learns something truthful when it uses RPD2 to
send a packet with a length within the Zone of Uncertainty.

*  This is not counting extreme cases where two attempts at sending
   the sets of packets do not result in the ITR receiving a report
   from the ETR - but that would be a case of at least temporarily
   very poor reachability between the two, so we can't expect
   anything better.


> Two questions, one note:
> 
> Question #1: How does the ITR determine that its old PMTU estimate has
> been invalidated, either because of a route change or because
> individual packets are being transmitted along multiple channels each
> with a different PMTU?

There needs to be some low rate of exploratory probing using RPD2
sending of some packets shorter than LPME and longer than UPME.


> If I understand you, packets are not transmitted with RPD2 unless the
> ITR believes the size falls in the uncertainty zone, 

Yes, except for the occasional exploratory shorter and longer packets.

> and not transmitted with the ITR's source IP address regardless,

The long Packet B of RPD2 is always sent with the outer header's
source address being that of the ITR.

> so the ITR has no real hope of seeing normal too-big complaints.
> So how does it ever decide that its estimated PMTU is no longer
> valid?

Ivip's ordinary encapsulation of traffic packets (IP-in-IP) has the
outer header set to the SH's address.  So the ITR gets no PTB from
them, and a properly implemented RFC 1191 SH would not recognise the
PTB either.

A SH which was looking out for this kind of PTB could detect it, but
I haven't explored this and am determined not to make any part of
Ivip dependent on host changes - other perhaps than a souped up
traceroute program.

Occasional shorter and longer exploratory probe packets, with direct
reports from the ETR will detect changes in the Real PMTU outside
LPME to UPME - but not as fast as if the normally encapsulated
traffic packets had the ITR's address as their source *and* the ITR
could store enough state to securely validate PTB messages they cause.

A non-Ivip ITR, or some other device using this IPTM - RPD2
procedure probably could use the ordinary encapsulation to detect
the Real PMTU getting shorter than it currently assumes.  The trick
would be to only cache the information for a handful of the longest
packets.  There's no point in caching stuff for the shorter ones
while longer ones are being sent, close to or at the limit set by
LPME.

Relying on securely checked PTBs is a pretty good way of finding out
that the Real PMTU has got shorter than LPME.  Using one or more
non-arrivals of the long probe packet at the ETR is not quite as
reliable, since this could occasionally occur due to bad luck with
packet loss.  It would be bad to lower LPME in a spurious way, due
just to non-arrival of the probe packet (rather than the gutsier way
of getting a real PTB).  This would result in the ITR sending a PTB
to the SH with a lower than needed MTU value.  The SH would then be
bound to use that value to limit its packet size for the next ten
minutes.  This is bad, but not disastrous - it is just a loss of
efficiency, rather than a loss of data or of connectivity.

Relying on a report from the ETR that a long packet did arrive OK is
the best way of detecting that the Real PMTU is higher than UPME.
The mere absence of PTBs is not as reliable, since they could be
dropped randomly (or the probe packet dropped randomly before it hit
the PMTU limiting router) - or perhaps the PTBs could be blocked by
 ICMP filters which violate the BCP recommendation.

IPTM - RPD2 can do its job reliably without PTBs from the tunnel,
but if they are there, that is better.  The ITR has to be able to
get the PTBs it generates to SH, but if it can't do that, then we
are sunk anyway.

The sections:

  Discovering changes in Real PMTU

  An alternative to the RPD2 approach of splitting the traffic
  packet

discuss the various approaches, with and without Ivip's "outer
source = SH" approach, including some promising possibilities of
ITRs only caching some packets, and alternatives to RPD2's approach
of splitting the traffic packet.


> Question #2: nearly every ITR->ETR map will trigger the use of RPD2 as
> two associated end sites begin transmitting data. 

This is quite different from the debate about "pure pull" (LISP-ALT
and TRRP, though I now think neither is quite so pure) ITRs
frequently delaying initial packets.

Firstly, RPD2 is only used for packets longer than 1200 bytes.  This
means that almost all session establishments will not be encumbered
by RPD2, since I figure very few protocols start up with such long
initial packets.  Many kinds of traffic will never require packets
longer than 1200 or whatever bytes, including DNS and almost all
HTTP traffic in the client -> server direction.  I figure SMTP and
many other protocols only have big packets going in one direction
for each session.

Secondly, the burden of RPD2 is primarily due to involving the ITR's
and the ETR's central CPU.  There is also the burden of sending
extra packets, but the probe Packet B is the same length as an
ordinarily encapsulated packet, and the 2 or maybe 3 short Packet
A's are likely to be 100 bytes or less each.

There no significant extra delay.  Assuming the Packet B and at
least one of the first two Packet A's get to the ETR, the traffic
packet is delivered.  This need not take more than a fraction of a
millisecond longer on high-speed links, unless the central CPU does
not have the capacity to attend to this promptly.  These delays
would be far shorter than the delay of looking up mapping in the ALT
or TRRP global query server system, or using their initial packet
delivery systems to get the packet to the ETR before the ITR has the
mapping.

Also, these RPD2 packets do not involve data loss to the
application.  Sometimes, they require a resend with a smaller packet
- but that is when the only way of delivering the original packet
would be via some fragmentation or other splitting mechanism, since
the packet, once encapsulated, was in fact too big for the tunnel PMTU.


> Given the complexity, you're looking at a general-purpose CPU on 
> both ends to handle this. What sort of impact does that have
> on the system capacity?

I can't say for sure.  I can't think of a simpler approach, and this
PMTUD stuff really does need to be solved.  There may well be some
gotchas, but the way it looks now is far better and cleaner than I
thought would be possible a few days ago.  Since October I have
assumed we would need synthetic probe packets and that it would be
necessary to break up some packets into smaller chunks to deliver
them in spite of PMTU limitations.

In this scheme, no traffic carrying probe packet goes to waste.   It
is either delivered and the ITR learns about the Real PMTU, or it is
not delivered, and the ITR also learns - with no application data
loss.  Then the RFC 1191 SH automatically cooks up a shorter packet,
which is just what is needed for the ITR to find out more about the
Real PMTU.


> Note #1: in your document, you describe the ETR returning multiple
> packets to the ITR for each received RPD2 packet, until the ITR
> acknowledges receipt. This potentially resurrects our old friend, the
> smurf amplifier.

This is definitely a gotcha.  This IPTM - RRG stuff didn't exist two
days ago, so it amenable to change.  Maybe limit the retries to a
single retry, or at most to two.  That only gives an amplification
factor of two or three.

The report packets would be pretty short, and if generated by an ETR
in response to bogus Packet As' would be ignored by most devices,
including any ITR.

Perhaps a way to discourage attackers using of this aspect of the
ETR's functionality would be to ensure that the Packet As needed to
be as long as the total length of the two or three ETR -> ITR report
packets.  But that just adds overhead to the entire protocol.

  Cheers

    - Robin


--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg