[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Comments on draft-ietf-shim6-failure-detection



Hi, Bernard,

Thank you for providing these comments (and thanks, Jari, for forwarding them).

A couple of thoughts inline,

Spencer



Here are some comments from Bernard Aboba. These relate to
the failure and reachability detection and perhaps also more
generally the division of work between the shim and
other protocol layers. I wasn't quite sure what to say
in response, so with Bernard's permission, I'm posting
the comments here:

----

I read the SHIM6 failure detection documents, and had some comments
relating to the time scale of failure detection.  I think there are some
issues with respect to "conservation of packets" that are worth exploring.


Section 5

   Also, it
   would be unfortunate if both the IP layer and transport/application
   layer took action for the same problem, for instance by switching to
   a new address at the IP layer and throttling back due to "congestion"
   at the transport layer.

This is not necessarily undesirable.  If the path over which a TCP
connection
travels changes, the transport parameters may have become invalid.  In
such
a situation, studies have shown that re-estimation actually may improve
performance, as compared to continuing to operate with potentially invalid
values.

Two points here - first, I am interested in pointers to these studies, because I'd like to understand the issues a lot better than I do, and, second, even if this is true, it's starting to feel like we're moving at least some transport functionality (liveness detection, for instance) into the network layer, which (as you point out) lives in a different timescale (theoretically we give link layers time to react, and then we give routing time to react, and THEN TCP starts thinking about retransmission timeouts, ignoring fast retransmit for now).

This concerns me, because the TRIGTRAN discussions proved fairly intractable given a first-hop path change, and TCP is still TCP, so I'm not sure why SHIM6 will end up at a different place if we try to pass a lot of clues back and forth between TCP and IP. The only signal both ends of a TCP connection share is loss, which both ends have to track anyway, so we got a lot of non-interest in adding more stuff that TCP implementations had to track in addition to loss.

I would therefore argue that the important issue is not action
in multiple layers, but rather the avoidance of race conditions;  a
well-defined communication mechanism between the IP and
transport/application
layer can help with this.

I agree here, with the caveat that it's challenging to know that we've avoided race conditions when they involve at least one protocol with adaptive timers..

   But it is less clear which protocol(s) should discover end-to-end
   connectivity problems or recover from them.  One answer is that this
   is clearly within the domain of multihoming protocol.  By performing
   testing and failure detection of the used path and switching to a new
   path if necessary, the transport and application protocols can work
   unchanged.

I am not clear that the "multi-homing protocol" necessarily has the right
information to do testing and failure detection correctly.

For example, it does not make sense to diagnose a "connectivity problem"
on a time scale less than RTO.  Yet only the tranport layer typically
possesses the RTO estimate.

Yes, exactly.

I can only add that I'm not sure where we are on one-way data streams (where the receiver does not send ACKs, or sends them very infrequently). It's hard for the IP layer to know whether "silence is OK", and if the transport/application layer has to provide this information, I'm not sure what value SHIM6 is adding.

We can explicitly say that we don't believe such things exist in the real world, but it would be good if SHIM6 did not prevent these applications from working if it DOES encounter them on live networks.

Similarly, if the cause of the connectivity loss is a route flap, then
only the routing layer might have knowledge of the loss of the route, and
only if it is participating in the routing mesh.  For example, in adhoc
networks, missing routes are a frequent contributor to packet loss, so
that integration of the routing and transport layers is required to be
able to respond  appropriately.

On a global scale, BGP route flaps can last for a few seconds (though
rarely longer than 30 seconds), suggesting a minimum time scale on which
"connectivity loss" can be detected (this is why RFC 3539 timers are set
at a minimum of 6 seconds, but at a default value of 30 seconds).