[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: failure detection
On Fri, 19 Aug 2005, marcelo bagnulo braun wrote:
Why must host1 detect this? Host2 could also ;).
not in a unidirectional connectivity scenario
consider the case where the failure implies that:
PrefA:Host1 -> Host2 is not working
PrefB:Host1 -> Host2 is working
Host2 -> PrefA:Host1 is working
Host2 -> PrefB:Host1 is not working
How would you cope with this case?
How important is this case?
Further, in your scenario, this was due to a local-failure near
Host1. A failure which can easily be detected locally without any
need for n^2 probing.
What's needed is:
- Host1 to detect the local failure and update the exit path to use
(and hence the source to use)
- this is achievable in multiple ways
- none of which need be in shim6
- none of which require shim6 to be aware of SAS or egress
issues
- Host2 shim6 to detect host1's valid locators have changed
- Maybe because it receives a packet from Host1 with a new
source
- Maybe because Host2's reachability probes detect PrefB
How common is this failure mode?
You want to specify that shim6 be able to work around /any/ kind of
routing failure, anywhere on any part of the internet affecting any
path between Host1 and Host2.
My gut feelings though are:
- Failures typically are near the edges
- Failures are typically bi-directional for a given path
- Uni-directional failures tend to be due to /congestion/, not
actual failures - again, typically at the edges. Congestion related
"failures" tend to be very transient/sporadic.
- Failures in the 'middle' are uncommon, and tend to affect /huge/
numbers of paths (ie there's a decent chance it will take out /all/
your paths)
- The problem of uni-directional failure on two /unrelated/ paths at
the same time is *tiny*
Hence (as a gut feeling):
- n^2 probing in shim6 is simply introducing huge expense in order to
solve a very uncommon problem
You think the tradeoff in order to achieve perfection is worth it.
I don't, I think the above is a general quality-of-internet-routing
problem. I think it's something that should and will be tackled
within the routing area, where people have been and are continuing
working on optimising routing protocols (from OSPF to BGP) to cope
gracefully with failures and restarts in order to eliminate some
common scenarios where routing-loops can occur in todays routing
protocols.
I don't see a compelling reason to consider problems in internet
routing to be something shim6 needs to introduce great complexity for
in order to work around, when a simple approach (let underlying OS
routing pick the local prefix) will likely allow 99% of failures to
be detectable and worked around.
I'd like to see more information on path failure modes seen on the
internet, what is common, what is not, before I would change my
position.
this last point is influencing SAS and trying with alternative
source locators, which is basically what we are considering here
and that you are opposing right?
Correct.
As I stated later in that mail, you can achieve nearly the same thing
by simply relying on external means to shim6 to update which egress
interface/address the OS uses.
My position is that "nearly the same thing" is more than good enough,
particularly when it could implement a great amount of complexity
from shim6.
Note that such external mechanisms would *not* be precluded from
doing n^2 probing, or any other fancy scheme they want to implement.
(I wouldn't care to implement it, but..). Ie I'm not arguing shim6
should preclude such probing, only that the base spec should assume
simple mechanisms and provide the protocol tools to allow more
complex probing (eg a 'PROBE' message or somesuch).
So weirder, more complex (and unneeded imho) probing and getting
involved in SAS would be an implementation detail ;).
we have already discussed this point (in multi6) and imho it is not
such a good idea to have the hosts to receive a full BGP feed.. i
think there was ssome kind of concensous on this point, but maybe
now has changed...
BGP feed (not advertising anything) is simply /one/ option - I gave a
list of possibilities, using a routing protocol was just one of them.
Using BGP would only be suitable for enterprise sites (presuming
shim6 allows for a split/proxy mode).
Eg, DSL connected hosts: My experience, with my ISP, is that the only
failures I actually notice are telco/local-loop related, or related
to the DSL cable run in my house - and my PPP stack (which implements
a type of keepalive) is the first to notice.
Maybe I just have a good ISP.
Oh, there's yet another option, in the long-run: If your ISP(s) sucks
so much that odd uni-directional path failures are regular enough in
occurance -> drop that ISP and go to another.
The idea is that shim performs e2e failure detection (as i already
mentioned earlier) so that the fate of the communicating parties
i.e. the apps is shared with the fault tolerance mechanism and that
the shim can detect all the potential outages and recover from
them. Having different mechanisms deal with different types of
failures would result in a reduced protection i guess
I can see why the goal is attractive. I wonder though whether the
required complexity is worth it.
but you are assuming that host1 detects local failures through
other means
Correct.
and this other means are likely to be injecting a full
BGP feed into end hosts, right?
No, that's not likely at all.
Your math is correct
Phew, cause it's been a while ;).
now let me ask: how many unidirectional address pairs are available
between two hosts having y and n addresses each one?
i guess that we agree that there are y^2+n^2 different
unidirectional address pairs, right?
Right.
So the point is: if you want to provide full fault tolerance you
need to explore them all, if you don't, there may exists available
paths that you are discarding, hence there are communications that
could be preserved but you are not finding the available path to
use.
Right.
Is it worth though?
You're coming from a position where you do not wish to have to rely
on the stability of internet routing. In your world the "internet
cloud" is likely to be swiss-cheese (full of "black holes" ;) ), and
can't be relied upon.
That isn't in line with my experience of the internet. IMHO, it's
reliable "enough". Further, it there are reliability problems in
internet routing, then surely the best way to deal with them is in
the /routing area/ working groups? ;)
Ie, is it worth it?
Unreliable ISPs will (eventually) be taken care of by market
pressures. They'll either fix their problems, maybe take advantage of
graceful-failure mechanisms which are becoming more prevalent in
routing protocols and implementations, or they will slowly die as
their customers move elsewhere.
Another factor to consider in routing reliability, that (IMHO)
obviates need to worry about so much about it in shim6: VoIP. As more
and more telcos switch over to unified IP core for both their voice
and data services, they're putting ever resources into researchers,
implementors and IETF to make routing 'perfect', at least for
intra-AS - they can't tolerate packet loss or odd latencies because
customers will *hear* it. In time I suspect even inter-AS routing
will be optimised to be much more stable than it is today, as
eventually (i suspect) VoIP inter-telco peering will replace SS7 (and
they'll get researchers/implementors and IETF to optimise that case
too).
Anyway, I think that's a fairly exhaustive explanation of why I think
perfection in using available paths should /not/ be a core shim6
objective.
I'll stop harping on on that topic. I would like to see justification
as to why it should be though.
Of course, you may have optimizations, like testing two
unidirectional paths (one in each directions) with a single packet
exchange (2 packets instead of 4), using local information for
discarding some addresses and so on, but again these are only
optimizations.
You could I guess. I wouldn't, but you could. I suspect most people
would be happy with just assuming the internet "cloud" is mostly
reliable.
Particularly: If it means that shim6 drafts are simplified, easier to
write, easier to get reviewed and approved, easier to implement the
basic functionality, etc.. then that means shim6 gets "to market"
(ick) quicker.
If I can get to use shim6 in N years because it's simple, rather than
N+2 because it tries to cope with every possible path-failure..
how? i guess that you are asuming a BGP feed on hosts right?
No, I was assuming a split-mode of operation, with
shimmed-address/ULID using hosts not doing the shimming, but edge
'shim routers' doing it (hence BGP would be confined on those edges).
The drafts consistently refer to shim sitting on each host. So my
assumption appears to be wrong. Though, I don't see why split-mode
would not be possible (particularly if ULID's are IPv6 addresses and
composed of a prefix and host identifier..).
I mean, a hosts with a single deafult route, wouldn't really know
which of the n addresses available in its single interface to use
for a given destiantion address... i mean DAS would not result in
any particular address and the source address would be selected
randomly...
There are many other possible mechanisms.
A host could have the following default route:
default via ISP1-gateway
via ISP2-gateway
ISP1-gateway device X src ISP1-PA-address
ISP2-gateway device X src ISP2-PA-address
Some external mechanism could update this route as required. Be it
gateway-probing, probing "well known hosts on the internet", a RIP
default announced from border routes, or even an application which
monitors route-lookups or the OS route-cache and probes those and
updates routes accordingly.
The wealth of possibilities if you at least /allow/ the
source-selector mechanism to be external to shim seems a compelling
reason for shim6 to /not/ (by default) get involved.
So far, we are assuming that the shim is a host based approach and
that each host performs all the functions of the shim
Yes.
the case of the proxy that you are mentioning have been considered
and it is attractive but it presents some difficulties, especially
w.r.t security... perhaps you could try to consider the security
implications of that split that you are considering...
There is one case where 'split' or 'proxy' mode shimming would be
possible without security ramifications, I think. The case where the
ULID's in use are IPv6 addresses, network prefix and host identifier.
Then a simple stateless static mapping on the "shimmers" (which would
be gateways into/out of the "shimmed" ULID network) will do. Security
then is simply not a concern, no more than it is for a normal router
with a static forwarding table.
not sure what you mean here... ULID is upper layer identifier,
right? so the ULID belongs to the shim host, so i guess they are in
the same machine
Right.
As above, it seems possible to me that shim6 could allow for at least
one usage that would not require shim and "ULP" to be same machine.
It could be very useful for small/medium size multihomed sites (ie
not large enough to get a global IPv6 prefix).
regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
Don't let your status become too quo!