[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Comments on REAP draft




The protocol as described by the state machine doesn't seem to work for ULPs
that, even when they are "unhappy", space their retransmissions more than
3 seconds apart. This is because starting in "idle, when the ULP sends one
packet, we move to "suspected peer problem". But after 3 seconds ("outgoing
timeout") we move back to "idle".
I don't think the protocol should make any assumptions on how frequently
an unhappy ULP might retransmit.

When I asked Jari about this before I think he said it was an attempt to
avoid a FBD keepalive message after the ULPs go silent (and are happy).
But I think this is fundamentally unavoidable. If the ULPs finish exchanging
packets, then I don't see how we can avoid having one end (the one that last
received a ULP packet) send a single FBD keepalive to the peer.

I don't understand the comment text that says:
   The REAP design allows performing both failure detection and address
   pair exploration in the same sequence of messages, without a need to
   designate a specific point when the current address pair is declared
   inoperational and the search for a new pair begins.
It seems like the timout event that triggers a message with Y=no is quite
different than the one that triggers a message with Y=yes. And the one that
causes a message with Y=no is the point at which the current address pair
has been declared inoperational. So I'm not convinced it is helpful to
be able to have the "FBD keepalive" (the payload reception report) and
the event reception report in the same shim6 message.

Is there an example of a case where both a payload reception report and an
event reception report is included in the same packet?
I can't find one. Should this happen it would be an indication that
payloads continue to be sent as alternate locators are used. In that case the
payload reception report might not contain enough information to tell which
locator pair was used; the locator pair might have switched due to the
exploration. So *if* there is a case when payload reception reports and event
reception reports are used at the same time, then the payload reception
report needs to include the locator pair(s) on which the payload(s) were
received.


Looking at the packet diagram on page 16, I was also confused by the who
actually detects the need to send a Y=no event.
I think there are three cases to look at in more detail:
	Locator pair stops working from A to B
	Locator pair stops working from B to A
	Locator pair stops working in both direction

The state machine looks very complex (perhaps because I don't understand it?).
For instance, do we really need that many different timers?
The state names are hard for me to understand. For instance, it isn't really
the peer that has a problem - the problem lies in the usage of a locator
pair in a particular direction. And one failure mode is when the current
locator pair stops working in both directions, which doesn't seem to be
captured in a separate state.

I think the protocol description can be simpler than state machine description.
I'll try to outline an event-based description in a separate email.

   Erik