[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: failure detection

To: Paul Jakma <paul@clubi.ie>
Subject: Re: failure detection
From: marcelo bagnulo braun <marcelo@it.uc3m.es>
Date: Sat, 20 Aug 2005 07:34:43 +0200
Cc: shim6 <shim6@psg.com>
In-reply-to: <Pine.LNX.4.63.0508191456120.5291@sheen.jakma.org>
References: <8622E6A4-B0D7-4C9B-B184-8EB2A7C2738E@muada.com> <Pine.LNX.4.63.0508141523170.7023@sheen.jakma.org> <efebcb5728efd81901d5357b3993b6db@it.uc3m.es> <Pine.LNX.4.63.0508171556080.5353@sheen.jakma.org> <efa6464a563345cc24542d6ab48f3538@it.uc3m.es> <Pine.LNX.4.63.0508171932550.5353@sheen.jakma.org> <0f13bcc353755a4b9b965267a6a7ffb1@it.uc3m.es> <Pine.LNX.4.63.0508181034240.5291@sheen.jakma.org> <d1bbabb2d2a04821223d24f940796d23@it.uc3m.es> <Pine.LNX.4.63.0508181513480.5291@sheen.jakma.org> <4eb5dc3a95d2217a22ab1d81e23fd10d@it.uc3m.es> <Pine.LNX.4.63.0508191456120.5291@sheen.jakma.org>


El 19/08/2005, a las 17:11, Paul Jakma escribió:

On Fri, 19 Aug 2005, marcelo bagnulo braun wrote:
Why must host1 detect this? Host2 could also ;).
not in a unidirectional connectivity scenario
consider the case where the failure implies that:
PrefA:Host1 -> Host2 is not working
PrefB:Host1 -> Host2 is working
Host2 -> PrefA:Host1 is working
Host2 -> PrefB:Host1 is not working
How would you cope with this case?
How important is this case?
Further, in your scenario, this was due to a local-failure near Host1. A failure which can easily be detected locally without any need for n^2 probing.
What's needed is:
- Host1 to detect the local failure and update the exit path to use
  (and hence the source to use)
	- this is achievable in multiple ways
	- none of which need be in shim6
	- none of which require shim6 to be aware of SAS or egress
	  issues
- Host2 shim6 to detect host1's valid locators have changed
	- Maybe because it receives a packet from Host1 with a new
	  source
	- Maybe because Host2's reachability probes detect PrefB
How common is this failure mode?
You want to specify that shim6 be able to work around /any/ kind of routing failure, anywhere on any part of the internet affecting any path between Host1 and Host2.
My gut feelings though are:
- Failures typically are near the edges
- Failures are typically bi-directional for a given path
- Uni-directional failures tend to be due to /congestion/, not
  actual failures - again, typically at the edges. Congestion related
  "failures" tend to be very transient/sporadic.
- Failures in the 'middle' are uncommon, and tend to affect /huge/
  numbers of paths (ie there's a decent chance it will take out /all/
  your paths)
- The problem of uni-directional failure on two /unrelated/ paths at
  the same time is *tiny*
Hence (as a gut feeling):
- n^2 probing in shim6 is simply introducing huge expense in order to
  solve a very uncommon problem
You think the tradeoff in order to achieve perfection is worth it.
I don't, I think the above is a general quality-of-internet-routing problem. I think it's something that should and will be tackled within the routing area, where people have been and are continuing working on optimising routing protocols (from OSPF to BGP) to cope gracefully with failures and restarts in order to eliminate some common scenarios where routing-loops can occur in todays routing protocols.

I don't see a compelling reason to consider problems in internet routing to be something shim6 needs to introduce great complexity for in order to work around, when a simple approach (let underlying OS routing pick the local prefix) will likely allow 99% of failures to be detectable and worked around.


ok, i guess we have come to key point here.

We agree that the proposed mechanism proposed for the shim is what is needed to deal with all failure modes and to identify if there is at least one working path right?

We seem to disagree about if the cost that implies is worth it, right?

you seem to consider that there are simpler methods that would deal with a significant amount of the common failure modes, in particular the one you detail above.

I guess that probably RFC3178 already provides a reasonable solution that provides a the protection level that you ask for. I mean RFC3178 protects from failures in the edges in a transparent fashion

...


There are many other possible mechanisms.

A host could have the following default route:

default via ISP1-gateway
	via ISP2-gateway

what if there is a single router in a link of the multihomed site? i mean, you cannot assume that in all links of the multihoemd site there will be as many routers as ISPs the site is multihomed too, right?

In this point, i guess you end up requiring source address based routing in the multihomed site, in order to allow the end host to force routing through the selected exit ISP and the shim using the source address to actually select the exit ISP hence the shim selecting the source address, i guess

ISP1-gateway device X src ISP1-PA-address
ISP2-gateway device X src ISP2-PA-address
Some external mechanism could update this route as required. Be it gateway-probing, probing "well known hosts on the internet", a RIP default announced from border routes, or even an application which monitors route-lookups or the OS route-cache and probes those and updates routes accordingly.

The wealth of possibilities if you at least /allow/ the source-selector mechanism to be external to shim seems a compelling reason for shim6 to /not/ (by default) get involved.

So far, we are assuming that the shim is a host based approach and that each host performs all the functions of the shim
Yes.
the case of the proxy that you are mentioning have been considered and it is attractive but it presents some difficulties, especially w.r.t security... perhaps you could try to consider the security implications of that split that you are considering...
There is one case where 'split' or 'proxy' mode shimming would be possible without security ramifications, I think. The case where the ULID's in use are IPv6 addresses, network prefix and host identifier.

Then a simple stateless static mapping on the "shimmers" (which would be gateways into/out of the "shimmed" ULID network) will do. Security then is simply not a concern, no more than it is for a normal router with a static forwarding table.


not sure what you mean.. are you thinking in something like GSE here?

not sure what you mean here... ULID is upper layer identifier, right? so the ULID belongs to the shim host, so i guess they are in the same machine
Right.
As above, it seems possible to me that shim6 could allow for at least one usage that would not require shim and "ULP" to be same machine. It could be very useful for small/medium size multihomed sites (ie not large enough to get a global IPv6 prefix).

i agree it would be useful but i still not sure how do you deal with security stuff in this case...

regards, marcelo

regards,
--
Paul Jakma	paul@clubi.ie	paul@jakma.org	Key ID: 64A2FF6A
Fortune:
Don't let your status become too quo!

Follow-Ups:
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>

References:
- failure detection
  - From: Iljitsch van Beijnum <iljitsch@muada.com>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>
- Re: failure detection
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: failure detection
  - From: Paul Jakma <paul@clubi.ie>

Prev by Date: Re: about reachability detection draft
Next by Date: Re: Thoughts about layering multi-addressing
Previous by thread: Re: failure detection
Next by thread: Re: failure detection
Index(es):
- Date
- Thread