[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: draft-rabbat-fault-notification-protocol-04.txt



Hi George,

You've probably had time to review Vishal's explanations by now. Comments to
the items you raised inline. 

> -----Original Message-----
> From: owner-ccamp@ops.ietf.org [mailto:owner-ccamp@ops.ietf.org] On Behalf
> Of George Newsome
> Sent: Tuesday, February 24, 2004 5:41 PM
> To: ccamp@ops.ietf.org
> Subject: Re: draft-rabbat-fault-notification-protocol-04.txt
> 
> All,
> 
> My attention was drawn to
> draft-rabbat-fault-notification-protocol-04.txt, which provokes the
> following comments.
> 
> 1) There seems to be some notion that the time taken to restore is a
> crucial element of high availability, yet overall availability is
> controlled by unprotected elements failure rate and by mean time to
> repair, rather than by switching time. (A 1 second switch is less
> 1/10000 of the generally accepted MTTR of 4 hrs)
> 
[Richard] High availability refers in the draft to the service availability.
In that respect, restoration is critical to ensure service recovery.

> 2) This draft seems to address the relatively simple problem of setting
> up the restoration path. It seems to completely ignore the much harder
> problem of allocating resources to the shared restoration path, and of
> actually locating the fault in an optical network to a single span in a
> time that is useful to restoration. 

[Richard] If I understand the comment correctly, you are referring to the
problem of path computation, which is a solved problem with many proposals
in the literature. It is also orthogonal to the notification problem.

The fault localization problem is also different from the objective of this
draft. Localization of the fault has to occur and the fault information
transmitted to a notification mechanism. The localization problem itself
takes a certain amount of time as you mentioned.  Feedback from our hardware
experts says that it's doable in the range of a few milliseconds.

>It makes no mention of the
> inaccuracies in network planning databases, which make one wonder
> whether precomputation of restoration paths will actually lead to faster
> restoration times. 

[Richard] Restoration path computation relies on some amount of accuracy no
matter when it is done, whether before or after the fault. Since one is
using the same database in both cases, precomputation will lead to faster
restoration time.

> Finally, it seems to presuppose that a network
> operator would make such a facilities database available to route
> computation at all. The suggestion in sect 6.2 that the physical length
> of the fibers be available for route computation is very unlikely in any
> network I have ever worked on.

[Richard] In the past, with no need for such information it may have been
irrelevant to provide it. For time-bounded shared-mesh recovery, this
information will be needed. It will afford the operator the sophistication
and bandwidth savings that shared-mesh provides.

> 
> 3) One must wonder whether a flooding approach is actually best anyway.
> The assumption seems to be that a flooding protocol PDU can be forced
> onto the front of the send queue, thereby incurring minimum delay. 

[Richard] We know from implementations of our and competing boxes that this
can be done. It is not central to the proposal but speeds up the
restoration. Please refer to Appendix A.2 for a computation of the queuing
delays.

> An
> additional assumption seems to be that there is only one fault in the
> network, and all bets are off if that is not true. There seem to be
> problems with both these assumptions. It seems to me that there are no
> mechanisms for truncating the PDU that is being sent, so there is a
> finite chance that a significant extra delay is incurred. Perhaps more
> serious is the assumption that all bets are off if there are multiple
> faults in the network. In general, multiple faults are those that lead
> to service outage. Two faults that do not interact, in that they do not
> contend for the same network resources, will be coupled by the flooding.

[Richard] Multiple faults that do not interact could be coupled if they
occur in a time interval which is smaller than the delay of the flooding
message across the network diameter. Even in a large network, this implies
faults must occur closer than a few 100 ms apart. 

In any case, please note that all bets are not off when it comes to FNP
conducting the notification.  FNP will achieve the notification irrespective
of the number of faults.  In the case of multiple faults, the timing bound
may not be guaranteed, if the common case one designs for by using FNP is
for a single fault. There is no restriction in the protocol itself not to
work with the assumption of multiple faults.  Moreover, multiple faults may
occur in less than 1% of the fault cases according to a major US carrier we
talked to.
SONET and other transport technologies only guarantee hard timing bounds in
the case of single failures. Our approach affords us a better recovery
procedure with proper planning.

> In addition, unsupressed restoration requests, which occur  when the
> fault cannot be rapidly located to a single span, will also generate
> restoration messages. 

[Richard] Please refer to earlier answer about localization

> It also seems to me that routing changes may well
> start to be flooded at the same time scale as restoration activity is
> taking place. There is no mention of possible interactions with this.

[Richard] This is because the draft is implementation-agnostic. Not having
chosen the mode of flooding, we do not discuss interactions between routing
and this flooding mechanism. If routing is used for flooding, the
interaction is a non-issue.

> 4) Assuming that this problem is worth solving, and that a flooding
> protocol is the best solution, is it a good idea to generate  yet
> another protocol that floods, and is LMP the vehicle of choice to embed
> yet another protocol? It seems to me that restoration has a strong
> interaction with routing change announcment, so it seems to me to make
> more sense to use those mechanisms rather than invent new ones.

[Richard] LMP was a proof-of-concept experiment as Vishal has mentioned.
Please refer to
draft-many-perf-flooding-based-fault-notification-experimental-00.  We've
been considering implementing FNP at the transport layer using a routing
protocol.

> 5) Until the effect of network database inaccuracies on the
> effectiveness of  precomputed restoration is better understood, the
> problem of allocating  resources in shared mesh networks is solved, and
> it is certain that all faults will be located to the correct span in a
> time useful to restoration, it seems to be premature to be proposing a
> solution to the final piece of the problem.
> 
[Richard] I believe we've answered each individual point in that sentence in
the previous paragraphs. Given that, none are a stumbling block to the
solution.
> 
> Regards
> 
> 	George
> 

Thanks a lot for the comments. 
Richard.