[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: draft-rabbat-fault-notification-protocol-04.txt
George,
Good to see your attention is activated;-)
I've been discussing privately with the authors my concerns over the last several meetings.
The fatal negative with the FNP approach is that the use of the protection path is not coordinated - no handshake between the two ends (and intermediate nodes) for use of the protection path. "All nodes notified of the failure will activate the recovery path by performing the required hardware reconfiguration". And the ingress node starts sending user traffic after an elapsed time window. This uncoordinated use of the protection path guarantees user traffic will be misconnected - unacceptable for an operator.
The key requirement in the P&R DT work was that misconnections are not allowed, and is why the DT's approach uses coordinated signaling to notify all nodes along the path. The DT's approach is referred in this draft as incurring "lengthy delay" vs. FNP.
Another draft for your attention is draft-rabbat-optical-recovery-reqs. Requirement 8 states "A recovery scheme SHOULD make sure that recovery actions correctly move traffic from failed paths to their respective recovery paths, such that the recovery actions do not result in long-term misconnections". This requirement needs to be reworded to "SHALL" and "long-term misconnections" to "any misconnections".
Deborah
-----Original Message-----
From: owner-ccamp@ops.ietf.org [mailto:owner-ccamp@ops.ietf.org]On
Behalf Of George Newsome
Sent: Tuesday, February 24, 2004 8:41 PM
To: ccamp@ops.ietf.org
Subject: Re: draft-rabbat-fault-notification-protocol-04.txt
All,
My attention was drawn to
draft-rabbat-fault-notification-protocol-04.txt, which provokes the
following comments.
1) There seems to be some notion that the time taken to restore is a
crucial element of high availability, yet overall availability is
controlled by unprotected elements failure rate and by mean time to
repair, rather than by switching time. (A 1 second switch is less
1/10000 of the generally accepted MTTR of 4 hrs)
2) This draft seems to address the relatively simple problem of setting
up the restoration path. It seems to completely ignore the much harder
problem of allocating resources to the shared restoration path, and of
actually locating the fault in an optical network to a single span in a
time that is useful to restoration. It makes no mention of the
inaccuracies in network planning databases, which make one wonder
whether precomputation of restoration paths will actually lead to faster
restoration times. Finally, it seems to presuppose that a network
operator would make such a facilities database available to route
computation at all. The suggestion in sect 6.2 that the physical length
of the fibers be available for route computation is very unlikely in any
network I have ever worked on.
3) One must wonder whether a flooding approach is actually best anyway.
The assumption seems to be that a flooding protocol PDU can be forced
onto the front of the send queue, thereby incurring minimum delay. An
additional assumption seems to be that there is only one fault in the
network, and all bets are off if that is not true. There seem to be
problems with both these assumptions. It seems to me that there are no
mechanisms for truncating the PDU that is being sent, so there is a
finite chance that a significant extra delay is incurred. Perhaps more
serious is the assumption that all bets are off if there are multiple
faults in the network. In general, multiple faults are those that lead
to service outage. Two faults that do not interact, in that they do not
contend for the same network resources, will be coupled by the flooding.
In addition, unsupressed restoration requests, which occur when the
fault cannot be rapidly located to a single span, will also generate
restoration messages. It also seems to me that routing changes may well
start to be flooded at the same time scale as restoration activity is
taking place. There is no mention of possible interactions with this.
4) Assuming that this problem is worth solving, and that a flooding
protocol is the best solution, is it a good idea to generate yet
another protocol that floods, and is LMP the vehicle of choice to embed
yet another protocol? It seems to me that restoration has a strong
interaction with routing change announcment, so it seems to me to make
more sense to use those mechanisms rather than invent new ones.
5) Until the effect of network database inaccuracies on the
effectiveness of precomputed restoration is better understood, the
problem of allocating resources in shared mesh networks is solved, and
it is certain that all faults will be located to the correct span in a
time useful to restoration, it seems to be premature to be proposing a
solution to the final piece of the problem.
Regards
George