[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: failure detection

To: marcelo bagnulo braun <marcelo@it.uc3m.es>
Subject: Re: failure detection
From: Paul Jakma <paul@clubi.ie>
Date: Wed, 17 Aug 2005 20:38:18 +0100 (IST)
Cc: shim6 <shim6@psg.com>
In-reply-to: <efa6464a563345cc24542d6ab48f3538@it.uc3m.es>
Mail-copies-to: paul@hibernia.jakma.org
Mail-followup-to: paul@hibernia.jakma.org
References: <8622E6A4-B0D7-4C9B-B184-8EB2A7C2738E@muada.com> <Pine.LNX.4.63.0508141523170.7023@sheen.jakma.org> <efebcb5728efd81901d5357b3993b6db@it.uc3m.es> <Pine.LNX.4.63.0508171556080.5353@sheen.jakma.org> <efa6464a563345cc24542d6ab48f3538@it.uc3m.es>

On Wed, 17 Aug 2005, marcelo bagnulo braun wrote:

El 17/08/2005, a las 18:08, Paul Jakma escribió:

I guess i agree, since we cannot be sure that all ULPs will provide such feedback, we cannot base the shim failure detection on the existence of such mechanisms, since it would result in limiting the shim applicability to only those ULPs


Indeed.

The path-probing is, imho, mostly a complete waste of time. But an implementation can go wild if it wants.

i am not sure what do you mean by this... i mean, if you don't have a ULP that provides feedback, how can you be sure that the other end is reachable?


IP is a best-effort protocol.

Reliability, etc. is a concern of the upper-layers (eg TCP).

Please explain why shim (which will, one hopes, look like fairly much existing IP layers)) needs to reinvent functionality traditionally not provided by IP?

"Ah, but shim can make use of the fact that multiple locators could be published for an endpoint!"

Is the likely answer, explain:

1. Why this is a compelling argument given that it's been possible to publish multiple addresses in DNS for a long long time, yet there has been 0 demand for either applications to implement n^2 path-probing of each local address to every remote address, or for OSes to implement some kind of 'path-probe' shim to provide such functionality for all applications?

2. How this path-probing will interact with routing policy?

The local administrator may have different cost local links. In order to express policy he may do something like set the default route to go via the low-cost link. (the default gets changes by some mechanism unknown to us, routing policy, script, whatever)

Along comes shim6, sending packets with every possible source it can find on the machine, as a consequence sending packets out of expensive links (eg dial-on-demand links, or the $LOTS/Mbyte link..)

Ie: The best way to honour local policy is to use INADDR_ANY and let the OS decide the source address by consulting local routing policy - alternatively, an administratively specified address. Why exactly is shim6 so different from everything else on the internet and special that this would not work for it?

3. The traditional way on the internet to guard against path failures is to get a routing feed (and no, that does *not* imply you advertise anything), why is shim6 so special that it can't defer to existing practice?

And reread 2 again :).

4. How will you decide which path is best?

Some apps may prefer high-bandwidth/slightly lossy links over low-bandwidth/no-loss links. Other apps completely the opposite. Once you start picking paths, how do you know what kind of path the application would prefer?

If you simply guess, how will your guess be any better than a very simple mechanism, eg using INADDR_ANY as source and just picking the first locator that replies?

(And again, the traditional way for administrators to set policy on what source address is the best to use is via routing policy. See 2 yet again).

5. Given n^2 path-probing does not scale, and could be /very/ expensive in some situations (and generally introducing complexity), do you have statistics on the general reliability of path failures in the internet to justify this expense and complexity?

Are there any statistics as to how many path failures are due to /local/ link failures? (which does not require n^2 path-probing to detect).

6. If path-probing really is desired, explain why this is shim6 specific? Why could this not be done as part of a seperate programme or protocol?

Some possibilities:

- software that does local path-probing to determine reachability of
  locally attached gateways (eg IPMP in Solaris for one)
- BFD and some other protocols in development
- software that monitors some well-known paths and adjusts local
  routing to suit based on administratively defined metrics
- software that monitors the systems route-cache and does
  path-probing for destinations that currently see flows

I could go on and on :).

Note that it's the probing using every single local address which bothers the most. Simple heartbeats and monitoring which set of locators are reachable and just picking one and sticking to it till you needed to switch, I could agree with.

i mean, i see probing as a last resource to confirm that an outage has occurred (and then a tool to explore alternative paths before diverting the actual data packets)

Why exactly must this be considered as a part of shim6? This does not seem to be a shim6 specific thing at all, for a start off.

As per above: picking the right /remote/ locator *is* a shim6 job - i agree on that. It's the n^2 probing I want to ensure is /not/ considered for inclusion in shim6 RFCs, other than as something mentioned as a possible implementation detail.

It's the complexity of what's being proposed which I find wrong.

so, while in multiple occasions it may not be needed and can be skipped, i see probing as a fundamental part of the shim

If you mean probing every combination of local and remote addresses for reachability, I really don't see how you could come to that conclusion.

There's many many years of existing deployment of IP using systems and applications that simply don't consider such complex probing worth it.

may agree with this, but imho it need to be taken into account when discussing the present topics


Sure.

Well, I'd love to see discussion of the signalling formats for shim6 btw, rather than less immediately important talk of "how could we modify OS network stacks?" and "we could detect path failures and work around them in ways nothing ever before has considered worth doing" :) - and more importantly, I'd hate to see base shim6 specifications cluttered up with this kind of stuff (which likely wont be implemented, or wont be implemented soon in case of network stack signalling additions).

I know several of you (marcelo, iljitsch, at least) have been thinking about how to solve v6 multihoming for a /long/ time. I *know* you know how to do it.

The problem is, now that the end /is/ actually in sight (an actual IETF WG chartered to work on a /specific/ solution!), you've moved on to considering problems /past/ shim6.

So here's your endpoint locator algorithm:

for every potential locator address for a ULID

	send a control message to probe the locator including
	 sufficient information for the other side to setup the shim
	 on their side

wait for the minimum of PROBE_TIMEOUT seconds or until
  you get a reply

If you got a reply, it should have enough information to setup the shim, set it up and finish.

Otherwise signal failure to the ULP (eg the system's equivalent of POSIX ENETUNREACH)

That's it, very simple and implementable.

Additionally, if you define shim6 to include a regular heartbeat, you can monitor reachability. Include the locator's idea of its addresses too, and two cookie fields (one for each side).

You can then detect:

- which locator addresses work, if one doesn't mark it as unreachable. - if its the current locator, just pick the next one which is not known to be unreachable (and so on).

- changes of locator on the /remote/ side
	- further, you can detect changes:
		- in advance (eg the locator can remove an address
	          in advance of it ceasing to accept packets on that
	          address, eg because of maintenance)
		- faster, eg the remote side may be monitoring its
	 	  local status, if it detects a change it can just
		  send a heartbeat immediately with the updated
		  locator addresses to use

etc..

<damn, i feel a draft coming on - are there better tools than opening a text editor?>

That's the kind of talk i want to see, about the actual nuts and bolts of what is needed for shim6 to work - less "pie in the sky" stuff. :)

regards, marcelo


regards,
--
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Our POP server was kidnapped by a weasel.

Follow-Ups:
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>

References:
- failure detection
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>

Prev by Date: Re: failure detection
Next by Date: Re: failure detection
Previous by thread: Re: failure detection
Next by thread: Re: failure detection
Index(es):
- Date
- Thread