[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Comments on draft-ietf-shim6-failure-detection

To: "Bernard Aboba" <aboba@internaut.com>
Subject: Re: Comments on draft-ietf-shim6-failure-detection
From: "Spencer Dawkins" <spencer@mcsr-labs.org>
Date: Mon, 31 Oct 2005 10:03:49 -0600
Cc: "shim6" <shim6@psg.com>
References: <4365B3D9.50803@piuha.net>

Hi, Bernard,

Thank you for providing these comments (and thanks, Jari, for forwardingthem).

A couple of thoughts inline,

Spencer

Here are some comments from Bernard Aboba. These relate to
the failure and reachability detection and perhaps also more
generally the division of work between the shim and
other protocol layers. I wasn't quite sure what to say
in response, so with Bernard's permission, I'm posting
the comments here:

----

I read the SHIM6 failure detection documents, and had some comments
relating to the time scale of failure detection.  I think there are some
issues with respect to "conservation of packets" that are worth exploring.


Section 5

   Also, it
   would be unfortunate if both the IP layer and transport/application
   layer took action for the same problem, for instance by switching to
   a new address at the IP layer and throttling back due to "congestion"
   at the transport layer.

This is not necessarily undesirable.  If the path over which a TCP
connection
travels changes, the transport parameters may have become invalid.  In
such
a situation, studies have shown that re-estimation actually may improve
performance, as compared to continuing to operate with potentially invalid
values.

Two points here - first, I am interested in pointers to these studies,because I'd like to understand the issues a lot better than I do, and,second, even if this is true, it's starting to feel like we're moving atleast some transport functionality (liveness detection, for instance) intothe network layer, which (as you point out) lives in a different timescale(theoretically we give link layers time to react, and then we give routingtime to react, and THEN TCP starts thinking about retransmission timeouts,ignoring fast retransmit for now).

This concerns me, because the TRIGTRAN discussions proved fairly intractablegiven a first-hop path change, and TCP is still TCP, so I'm not sure whySHIM6 will end up at a different place if we try to pass a lot of clues backand forth between TCP and IP. The only signal both ends of a TCP connectionshare is loss, which both ends have to track anyway, so we got a lot ofnon-interest in adding more stuff that TCP implementations had to track inaddition to loss.

I would therefore argue that the important issue is not action
in multiple layers, but rather the avoidance of race conditions;  a
well-defined communication mechanism between the IP and
transport/application
layer can help with this.

I agree here, with the caveat that it's challenging to know that we'veavoided race conditions when they involve at least one protocol withadaptive timers..

   But it is less clear which protocol(s) should discover end-to-end
   connectivity problems or recover from them.  One answer is that this
   is clearly within the domain of multihoming protocol.  By performing
   testing and failure detection of the used path and switching to a new
   path if necessary, the transport and application protocols can work
   unchanged.

I am not clear that the "multi-homing protocol" necessarily has the right
information to do testing and failure detection correctly.

For example, it does not make sense to diagnose a "connectivity problem"
on a time scale less than RTO.  Yet only the tranport layer typically
possesses the RTO estimate.

Yes, exactly.

I can only add that I'm not sure where we are on one-way data streams (wherethe receiver does not send ACKs, or sends them very infrequently). It's hardfor the IP layer to know whether "silence is OK", and if thetransport/application layer has to provide this information, I'm not surewhat value SHIM6 is adding.

We can explicitly say that we don't believe such things exist in the realworld, but it would be good if SHIM6 did not prevent these applications fromworking if it DOES encounter them on live networks.

Similarly, if the cause of the connectivity loss is a route flap, then
only the routing layer might have knowledge of the loss of the route, and
only if it is participating in the routing mesh.  For example, in adhoc
networks, missing routes are a frequent contributor to packet loss, so
that integration of the routing and transport layers is required to be
able to respond  appropriately.

On a global scale, BGP route flaps can last for a few seconds (though
rarely longer than 30 seconds), suggesting a minimum time scale on which
"connectivity loss" can be detected (this is why RFC 3539 timers are set

at a minimum of 6 seconds, but at a default value of 30 seconds).

Follow-Ups:
- Re: Comments on draft-ietf-shim6-failure-detection
  - From: Jari Arkko <jari.arkko@piuha.net>

References:
- FW: Comments on draft-ietf-shim6-failure-detection
  - From: Jari Arkko <jari.arkko@piuha.net>

Prev by Date: Re: RFC3484bis (was Re: Design decisions made at the interim SHIM6 WG meeting
Next by Date: Re: Comments on draft-ietf-shim6-failure-detection
Previous by thread: Re: Comments on draft-ietf-shim6-failure-detection
Next by thread: Re: Comments on draft-ietf-shim6-failure-detection
Index(es):
- Date
- Thread