[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Failure Detection (was Re: soft state (was Re: shim6 and bit errors in data packet headers



On 4-jun-2005, at 18:11, marcelo bagnulo braun wrote:

On the other hand, I wouldn't necessarily put too much trust in what weird ULPs have to say. But as long as what they have to say can only help or hurt themselves I don't really care, of course.

that brings out an interesting issue: what if we have multiple ULPs using the same session and they provide different feedback?

For instance, a simple case would be that some apps are more sensitive than others, so they will complain sooner. More complex cases could be that one app complains and the other one provides positive feedback (suppose that the failure is on the app level and not in the path for instance) how do we deal with this?

I think we have to be prepared for the situation where an ULP provides incorrect feedback.


IMHO, ULP feedback should result in an explicity reachabililty test on the current locator pair i.e. ULP feedback does not directly implies rehoming, but in a verification though a reachability test exchange of the current locator pair.

Yes. But we also rate limit these tests to avoid excessive probing. For instance, use exponential backoff for negative feedback. So if an ULP complains, first time we do a reachability test immediately. If there is nothing wrong that time and it then complains again within X seconds (60 or so) we schedule a reachability test for 10 seconds from now. If it then complains again within 60 seconds after that test (ie 70 seconds) the next one is scheduled in 20 seconds. And so on until we pretty much don't listen to that ULP any more.


I'm not sure how positive feedback enters the equation, except maybe to cancel pending scheduled reachability tests.

Failure detection hints including:
- ULP negative feedback
- Tx>0 and Rx=0
- Receive a reachability test exchange from the peer (do we still need this?)

Yes, if the ULP doesn't provide hints we can't discover a unidirectional reachability failure from here to the other end, so we need to depend on the other end to detect those.


And if the other end started the exchange already with one extra packet we know what's going on too. Even if we don't really need to know what's up with the connectivity it might still be good to measure the RTT.

    - ICMP error
    - SHIM error ¿?

As the result of any of these hints, a reachability test exchange is performed using the current locator pair

Yes. Only exception to this could be when the other side starts a reachability test because they don't hear from us anymore and we know we sent them stuff recently. In that case it's almost certain that the current pair in that direction doesn't work anymore.


If reachability test succeed then keep on using the current locator pair
If reachability test fails, the start alternative locator pair exploration process

When one alternative locator pair that is working is found, then rehome the communication

Did i miss something?

We need to avoid bad paths. So just selecting the first one that works could be dangerous. Obviously it helps to be smart about what to try first. For instance, if host A notices that for 10 sessions to different destinations with source or dest A1 there are (potential) problems while A2 is doing fine, try with A2 first of course. (And include information in the probe that tells the other side to try A2 first as well.)


In addition, we could send out probes at an interval that's only slightly higher than the RTT. This means that if the RTT for the address pair we're testing is lower than the current one, we have an answer before we send out the second probe, and it's likely we found a good path. If the answer takes longer than (say) 80% probability for the RTT for the old path considering the stddev, it's likely we're trying a bad path so sending out a second probe is good. Even if we get an answer for the first one after sending the second one, if the second one uses a better path we'll get that answer shortly after than and we can immediately rehome to that path if it looks better.

Iljitsch