[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: failure detection
On Wed, 17 Aug 2005, marcelo bagnulo braun wrote:
El 17/08/2005, a las 18:08, Paul Jakma escribió:
I guess i agree, since we cannot be sure that all ULPs will provide
such feedback, we cannot base the shim failure detection on the
existence of such mechanisms, since it would result in limiting the
shim applicability to only those ULPs
Indeed.
The path-probing is, imho, mostly a complete waste of time. But an
implementation can go wild if it wants.
i am not sure what do you mean by this... i mean, if you don't have
a ULP that provides feedback, how can you be sure that the other
end is reachable?
IP is a best-effort protocol.
Reliability, etc. is a concern of the upper-layers (eg TCP).
Please explain why shim (which will, one hopes, look like fairly much
existing IP layers)) needs to reinvent functionality traditionally
not provided by IP?
"Ah, but shim can make use of the fact that multiple locators could
be published for an endpoint!"
Is the likely answer, explain:
1. Why this is a compelling argument given that it's been possible to
publish multiple addresses in DNS for a long long time, yet there has
been 0 demand for either applications to implement n^2 path-probing
of each local address to every remote address, or for OSes to
implement some kind of 'path-probe' shim to provide such
functionality for all applications?
2. How this path-probing will interact with routing policy?
The local administrator may have different cost local links. In order
to express policy he may do something like set the default route to
go via the low-cost link. (the default gets changes by some mechanism
unknown to us, routing policy, script, whatever)
Along comes shim6, sending packets with every possible source it can
find on the machine, as a consequence sending packets out of
expensive links (eg dial-on-demand links, or the $LOTS/Mbyte link..)
Ie: The best way to honour local policy is to use INADDR_ANY and let
the OS decide the source address by consulting local routing policy -
alternatively, an administratively specified address. Why exactly is
shim6 so different from everything else on the internet and special
that this would not work for it?
3. The traditional way on the internet to guard against path failures
is to get a routing feed (and no, that does *not* imply you advertise
anything), why is shim6 so special that it can't defer to existing
practice?
And reread 2 again :).
4. How will you decide which path is best?
Some apps may prefer high-bandwidth/slightly lossy links over
low-bandwidth/no-loss links. Other apps completely the opposite. Once
you start picking paths, how do you know what kind of path the
application would prefer?
If you simply guess, how will your guess be any better than a very
simple mechanism, eg using INADDR_ANY as source and just picking the
first locator that replies?
(And again, the traditional way for administrators to set policy on
what source address is the best to use is via routing policy. See 2
yet again).
5. Given n^2 path-probing does not scale, and could be /very/
expensive in some situations (and generally introducing complexity),
do you have statistics on the general reliability of path failures in
the internet to justify this expense and complexity?
Are there any statistics as to how many path failures are due to
/local/ link failures? (which does not require n^2 path-probing to
detect).
6. If path-probing really is desired, explain why this is shim6
specific? Why could this not be done as part of a seperate programme
or protocol?
Some possibilities:
- software that does local path-probing to determine reachability of
locally attached gateways (eg IPMP in Solaris for one)
- BFD and some other protocols in development
- software that monitors some well-known paths and adjusts local
routing to suit based on administratively defined metrics
- software that monitors the systems route-cache and does
path-probing for destinations that currently see flows
I could go on and on :).
Note that it's the probing using every single local address which
bothers the most. Simple heartbeats and monitoring which set of
locators are reachable and just picking one and sticking to it till
you needed to switch, I could agree with.
i mean, i see probing as a last resource to confirm that an outage
has occurred (and then a tool to explore alternative paths before
diverting the actual data packets)
Why exactly must this be considered as a part of shim6? This does not
seem to be a shim6 specific thing at all, for a start off.
As per above: picking the right /remote/ locator *is* a shim6 job - i
agree on that. It's the n^2 probing I want to ensure is /not/
considered for inclusion in shim6 RFCs, other than as something
mentioned as a possible implementation detail.
It's the complexity of what's being proposed which I find wrong.
so, while in multiple occasions it may not be needed and can be
skipped, i see probing as a fundamental part of the shim
If you mean probing every combination of local and remote addresses
for reachability, I really don't see how you could come to that
conclusion.
There's many many years of existing deployment of IP using systems
and applications that simply don't consider such complex probing
worth it.
may agree with this, but imho it need to be taken into account when
discussing the present topics
Sure.
Well, I'd love to see discussion of the signalling formats for shim6
btw, rather than less immediately important talk of "how could we
modify OS network stacks?" and "we could detect path failures and
work around them in ways nothing ever before has considered worth
doing" :) - and more importantly, I'd hate to see base shim6
specifications cluttered up with this kind of stuff (which likely
wont be implemented, or wont be implemented soon in case of network
stack signalling additions).
I know several of you (marcelo, iljitsch, at least) have been
thinking about how to solve v6 multihoming for a /long/ time. I
*know* you know how to do it.
The problem is, now that the end /is/ actually in sight (an actual
IETF WG chartered to work on a /specific/ solution!), you've moved on
to considering problems /past/ shim6.
So here's your endpoint locator algorithm:
for every potential locator address for a ULID
send a control message to probe the locator including
sufficient information for the other side to setup the shim
on their side
wait for the minimum of PROBE_TIMEOUT seconds or until
you get a reply
If you got a reply, it should have enough information to setup the
shim, set it up and finish.
Otherwise signal failure to the ULP (eg the system's equivalent of
POSIX ENETUNREACH)
That's it, very simple and implementable.
Additionally, if you define shim6 to include a regular heartbeat, you
can monitor reachability. Include the locator's idea of its addresses
too, and two cookie fields (one for each side).
You can then detect:
- which locator addresses work, if one doesn't mark it as
unreachable.
- if its the current locator, just pick the next one which
is not known to be unreachable (and so on).
- changes of locator on the /remote/ side
- further, you can detect changes:
- in advance (eg the locator can remove an address
in advance of it ceasing to accept packets on that
address, eg because of maintenance)
- faster, eg the remote side may be monitoring its
local status, if it detects a change it can just
send a heartbeat immediately with the updated
locator addresses to use
etc..
<damn, i feel a draft coming on - are there better tools than opening
a text editor?>
That's the kind of talk i want to see, about the actual nuts and
bolts of what is needed for shim6 to work - less "pie in the sky"
stuff. :)
regards, marcelo
regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
Our POP server was kidnapped by a weasel.