[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: failure detection




El 17/08/2005, a las 21:38, Paul Jakma escribió:


IP is a best-effort protocol.

Reliability, etc. is a concern of the upper-layers (eg TCP).

Please explain why shim (which will, one hopes, look like fairly much existing IP layers)) needs to reinvent functionality traditionally not provided by IP?


SHIM wg is about providing multihoming support for IPv6. In particular a solution for IPv6 multihoming must be able to preserve communications through outages in the communicating path. Such functionality is provided in IPv4 through BGP features but it requires the injection of site routes in the interdomain routing. In IPv6 PA addressing is used, so we need additional mechanisms (the shim) to provide equivalent functionalities, in particular to preserve established communications through outages.


"Ah, but shim can make use of the fact that multiple locators could be published for an endpoint!"

Is the likely answer, explain:

1. Why this is a compelling argument given that it's been possible to publish multiple addresses in DNS for a long long time, yet there has been 0 demand for either applications to implement n^2 path-probing of each local address to every remote address, or for OSes to implement some kind of 'path-probe' shim to provide such functionality for all applications?


I am afraid you are missing our goal here. this is not a matter of oportunity but the way we can preserve established communication through outages. see above


2. How this path-probing will interact with routing policy?

The local administrator may have different cost local links. In order to express policy he may do something like set the default route to go via the low-cost link. (the default gets changes by some mechanism unknown to us, routing policy, script, whatever)

Along comes shim6, sending packets with every possible source it can find on the machine, as a consequence sending packets out of expensive links (eg dial-on-demand links, or the $LOTS/Mbyte link..)


With the shim, path are closely related to addresses used, in particular exit paths of the multihomed site are related to source addresses used. So in order to provide this type of features, source address selection has to be influenced, for instance using RFC 3484 policy table


Ie: The best way to honour local policy is to use INADDR_ANY and let the OS decide the source address by consulting local routing policy - alternatively, an administratively specified address. Why exactly is shim6 so different from everything else on the internet and special that this would not work for it?


this is exactly how the shim would support policing see above

3. The traditional way on the internet to guard against path failures is to get a routing feed (and no, that does *not* imply you advertise anything), why is shim6 so special that it can't defer to existing practice?


scalability. traditional IPv4 routing based multihoming lacks of it

And reread 2 again :).

4. How will you decide which path is best?

Some apps may prefer high-bandwidth/slightly lossy links over low-bandwidth/no-loss links. Other apps completely the opposite. Once you start picking paths, how do you know what kind of path the application would prefer?


local policy can be expressed to some degree with the policy table defined in RFC 3484. If more fine grained expression is needed (e.g. per app) additional parameters need to be included in the policy table


If you simply guess, how will your guess be any better than a very simple mechanism, eg using INADDR_ANY as source and just picking the first locator that replies?

(And again, the traditional way for administrators to set policy on what source address is the best to use is via routing policy. See 2 yet again).

5. Given n^2 path-probing does not scale, and could be /very/ expensive in some situations (and generally introducing complexity), do you have statistics on the general reliability of path failures in the internet to justify this expense and complexity?


we seem to be assuming that multihoming support is something useful and that it will be needed in IPv6. This multihoming support seems to require communications to be preserved through outages.


Are there any statistics as to how many path failures are due to /local/ link failures? (which does not require n^2 path-probing to detect).

6. If path-probing really is desired, explain why this is shim6 specific?

path exploration is a fundamental part of the shim protocol-. maybe is not shim specific and ideas from other similar protocols can be used, but it is imho a key part of the shim protocol and need to be part of it.


Why could this not be done as part of a seperate programme or protocol?

Some possibilities:

- software that does local path-probing to determine reachability of
  locally attached gateways (eg IPMP in Solaris for one)

this seems to be local, while shim is defined e2e

- BFD and some other protocols in development

B means bidirectional, and we are not assuming bidirectional paths here

- software that monitors some well-known paths and adjusts local
  routing to suit based on administratively defined metrics
- software that monitors the systems route-cache and does
  path-probing for destinations that currently see flows



not sure what you mean by those but in any case, i am sure we can benefit from these designs as well from the others you emntioned to design the shim path exploration. If you are familiar with those, i am sure that your knowledge would be very useful to help with the design of the path exploration protocol of the shim



I could go on and on :).

Note that it's the probing using every single local address which bothers the most. Simple heartbeats and monitoring which set of locators are reachable and just picking one and sticking to it till you needed to switch, I could agree with.

AFAICT this is the approach being considered here or at least one of them. I mean, imho, we would only need to perform path exploration after an outage



i mean, i see probing as a last resource to confirm that an outage has occurred (and then a tool to explore alternative paths before diverting the actual data packets)

Why exactly must this be considered as a part of shim6? This does not seem to be a shim6 specific thing at all, for a start off.



I fail to understand what you are missing. Failure detection and Path exploration are key components of the shim, and they are needed to preserve established communications through outages.


As per above: picking the right /remote/ locator *is* a shim6 job - i agree on that. It's the n^2 probing I want to ensure is /not/ considered for inclusion in shim6 RFCs, other than as something mentioned as a possible implementation detail.


Ok, i think i see now.
Your problem is with probing with different source locators, right?
Well, this is needed because the source address determines the exit path from the multihomed site. I mean, because we are assuming PA addressing, changing the source address results in using a different ISP in the multihomed site. That is why different source address need to be explored


It's the complexity of what's being proposed which I find wrong.

so, while in multiple occasions it may not be needed and can be skipped, i see probing as a fundamental part of the shim

If you mean probing every combination of local and remote addresses for reachability, I really don't see how you could come to that conclusion.


There's many many years of existing deployment of IP using systems and applications that simply don't consider such complex probing worth it.


right, because they are not assuming the usage of multiple PA addresses in a single host


When you include multiple PA addresses in hosts within a multihomed site, then you find out that you need to try with different source addresses.


may agree with this, but imho it need to be taken into account when discussing the present topics

Sure.

Well, I'd love to see discussion of the signalling formats for shim6 btw, rather than less immediately important talk of "how could we modify OS network stacks?" and "we could detect path failures and work around them in ways nothing ever before has considered worth doing" :) - and more importantly, I'd hate to see base shim6 specifications cluttered up with this kind of stuff (which likely wont be implemented, or wont be implemented soon in case of network stack signalling additions).

I know several of you (marcelo, iljitsch, at least) have been thinking about how to solve v6 multihoming for a /long/ time. I *know* you know how to do it.

The problem is, now that the end /is/ actually in sight (an actual IETF WG chartered to work on a /specific/ solution!), you've moved on to considering problems /past/ shim6.

So here's your endpoint locator algorithm:

for every potential locator address for a ULID

	send a control message to probe the locator including
	 sufficient information for the other side to setup the shim
	 on their side

wait for the minimum of PROBE_TIMEOUT seconds or until
  you get a reply

If you got a reply, it should have enough information to setup the shim, set it up and finish.


sort of... you still lacking DoS protection and locator security but kind of what is being considered (with a couple of additional messages)


Otherwise signal failure to the ULP (eg the system's equivalent of POSIX ENETUNREACH)

That's it, very simple and implementable.

Additionally, if you define shim6 to include a regular heartbeat, you can monitor reachability. Include the locator's idea of its addresses too, and two cookie fields (one for each side).


well, yes, but we are considering quite a few optimizations for this, like ULP feedback and traffic monitoring also
but yes, This is in the lines of the failure detection mechanism being considered,


You can then detect:

- which locator addresses work, if one doesn't mark it as unreachable.
	- if its the current locator, just pick the next one which
	  is not known to be unreachable (and so on).


path exploration is more complex than that, because you need to change the source address to change the exit ISP. remeber that we are assuming PA addresses, and they are only routed through one of the ISPs of the multihomed site.


- changes of locator on the /remote/ side
	- further, you can detect changes:
		- in advance (eg the locator can remove an address
	          in advance of it ceasing to accept packets on that
	          address, eg because of maintenance)
		- faster, eg the remote side may be monitoring its
	 	  local status, if it detects a change it can just
		  send a heartbeat immediately with the updated
		  locator addresses to use

etc..

<damn, i feel a draft coming on - are there better tools than opening a text editor?>

That's the kind of talk i want to see, about the actual nuts and bolts of what is needed for shim6 to work - less "pie in the sky" stuff. :)


i think that most of this stuff is included in the current drafts, just that additional complexity is considered, for instance security stuff, unidirectional path support and cosniderations about the constraints imposed by the usage of multiple PA addresses in the multohomed site and ingress filtering


regards, marcelo


regards, marcelo

regards, -- Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A Fortune: Our POP server was kidnapped by a weasel.