[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Coordination in FNP [was RE: draft-rabbat-fault-notification-protocol-04.txt]



Hi Deborah,

Thanks for your interest in FNP. My text inline about your comments.

> -----Original Message-----

[parts deleted]

> 
> The fatal negative with the FNP approach is that the use of the protection
> path is not coordinated - no handshake between the two ends (and
> intermediate nodes) for use of the protection path. "All nodes notified of
> the failure will activate the recovery path by performing the required
> hardware reconfiguration". And the ingress node starts sending user
> traffic after an elapsed time window. This uncoordinated use of the
> protection path guarantees user traffic will be misconnected -
> unacceptable for an operator.
> 
[Richard] I want to reemphasize that the FNP method is not a "free for all"
kind of notification. Rather it's a very accurate selection of notification
paths and times that will ensure propagation of the notification information
to the right nodes in the right amount of time.  The key to coordination in
FNP is the precise selection of restoration paths for any single or multiple
faults.

The extent to which one meets the timing bounds is determined by the
criteria used to pick the restoration paths. For example, if a carrier plans
for all single faults, then that carrier will recover from that within the
prescribed time bounds. I would refer you to the end of section 6 for the
working of the scheme.

> The key requirement in the P&R DT work was that misconnections are not
> allowed, and is why the DT's approach uses coordinated signaling to notify
> all nodes along the path. The DT's approach is referred in this draft as
> incurring "lengthy delay" vs. FNP.

[Richard] The draft is not referring to the DT's approach, which is not
concerned with the problem of time-bounded notification.

You can go to version -00 of the draft dating back to June 2002 and read our
comparison of two solutions for time-bounded notification: a theoretical
signaling-based solution vs. FNP.  Therefore, the signaling approach
referred to in this draft is a theoretical solution for the problem we
describe and is therefore not the DT's approach.

In our I-D, "lengthy delay" in section 1, para 4, refers to path-based vs.
link-based restoration, sorry for the confusion. It's a general observation
about different protection techniques. 

> Another draft for your attention is draft-rabbat-optical-recovery-reqs.
> Requirement 8 states "A recovery scheme SHOULD make sure that recovery
> actions correctly move traffic from failed paths to their respective
> recovery paths, such that the recovery actions do not result in long-term
> misconnections". This requirement needs to be reworded to "SHALL" and
> "long-term misconnections" to "any misconnections".
> 
[Richard] Misconnections are a fact of life in any carrier network.  Every
carrier requires the ability to detect and remediate for misconnections.
Misconnections can happen for a variety of reasons. For example, the
following issues unrelated to P&R in general may lead to misconnections

1- Bugs in the control plane which result in the control plane not
initiating the right cross-connect in the data plane: this could occur at
path setup phase as well as during restoration, no matter what technique one
uses for notification and coordination.
2- Bugs in the data plane such as a corrupted cross-connect table, notably
in the case of swapped entries
3- Incorrect or failed operation of the squelching function

That is why the requirement mentions long-term misconnections. We've been
trying to find the appropriate wording. Would you agree with replacing
"long-term" with "persistent" misconnections?

As far as FNP is concerned, the only time at which misconnections are even
likely is when you have at least two faults which are very close together in
time, for example less than 100-200 ms apart. Even in that case, one can
only get misconnections if the backup paths of the faults share the same
resources, for example a common link, *and* the following situation occurs:
a node that receives the 1st notification message initiates a
cross-connection, then gets a new message that asks it to cross-connect
otherwise, but doesn't have enough time to squelch the traffic.  If that
happens, the trace function, which is available to detect all kinds of other
misconnections, will also be used here and traffic will not be delivered to
its unintended recipients. 

If you have thought of other scenarios, perhaps we can discuss them with a
specific example/figure.

> Deborah
> 
Thanks,
Richard

[parts deleted]