[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FW: Comments on draft-ietf-shim6-failure-detection




Here are some comments from Bernard Aboba. These relate to
the failure and reachability detection and perhaps also more
generally the division of work between the shim and
other protocol layers. I wasn't quite sure what to say
in response, so with Bernard's permission, I'm posting
the comments here:

----

I read the SHIM6 failure detection documents, and had some comments
relating to the time scale of failure detection.  I think there are some
issues with respect to "conservation of packets" that are worth exploring.


Section 5

   Also, it
   would be unfortunate if both the IP layer and transport/application
   layer took action for the same problem, for instance by switching to
   a new address at the IP layer and throttling back due to "congestion"
   at the transport layer.

This is not necessarily undesirable.  If the path over which a TCP
connection
travels changes, the transport parameters may have become invalid.  In
such
a situation, studies have shown that re-estimation actually may improve
performance, as compared to continuing to operate with potentially invalid
values.

I would therefore argue that the important issue is not action
in multiple layers, but rather the avoidance of race conditions;  a
well-defined communication mechanism between the IP and
transport/application
layer can help with this.

   But it is less clear which protocol(s) should discover end-to-end
   connectivity problems or recover from them.  One answer is that this
   is clearly within the domain of multihoming protocol.  By performing
   testing and failure detection of the used path and switching to a new
   path if necessary, the transport and application protocols can work
   unchanged.

I am not clear that the "multi-homing protocol" necessarily has the right
information to do testing and failure detection correctly.

For example, it does not make sense to diagnose a "connectivity problem"
on a time scale less than RTO.  Yet only the tranport layer typically
possesses the RTO estimate.

Similarly, if the cause of the connectivity loss is a route flap, then
only the routing layer might have knowledge of the loss of the route, and
only if it is participating in the routing mesh.  For example, in adhoc
networks, missing routes are a frequent contributor to packet loss, so
that integration of the routing and transport layers is required to be
able to respond  appropriately.

On a global scale, BGP route flaps can last for a few seconds (though
rarely longer than 30 seconds), suggesting a minimum time scale on which
"connectivity loss" can be detected (this is why RFC 3539 timers are set
at a minimum of 6 seconds, but at a default value of 30 seconds).

   One can also envision that applications would be able to tell the IP
   or transport layer that the current connection is unsatisfactory and
   an exploration for a better one would be desirable.  This would
   require an API to be developed, however.

The application layer does have the ability to diagnose connectivity
problems on the order of seconds, through keep-alives.   The IP layer
generally does not have the ability to detect whether a connection is
"satisfactory" since it does not have access to the TCB, only
knowledge of potential causes of connectivity problems (such as path
changes or missing routes), which it can provide to the transport layer
or to applications.