[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: draft-rabbat-fault-notification-protocol-04.txt



Hi Richard,

Thanks for your reply - a couple of points inline (with some snipping...)


Richard Rabbat wrote:


Hi George,

You've probably had time to review Vishal's explanations by now. Comments to
the items you raised inline.



-----Original Message-----
From: owner-ccamp@ops.ietf.org [mailto:owner-ccamp@ops.ietf.org] On Behalf
Of George Newsome
Sent: Tuesday, February 24, 2004 5:41 PM
To: ccamp@ops.ietf.org
Subject: Re: draft-rabbat-fault-notification-protocol-04.txt
..........


2) This draft seems to address the relatively simple problem of setting
up the restoration path. It seems to completely ignore the much harder
problem of allocating resources to the shared restoration path, and of
actually locating the fault in an optical network to a single span in a
time that is useful to restoration.


[Richard] If I understand the comment correctly, you are referring to the
problem of path computation, which is a solved problem with many proposals
in the literature. It is also orthogonal to the notification problem.

[George] Pre computation of shared restoration paths is a distributed computaion problem. Its hard to see this being done without a new protocol or significant modifications to existing protocols. I am not aware of a plethora of solutions to this particular problem.

The fault localization problem is also different from the objective of this
draft. Localization of the fault has to occur and the fault information
transmitted to a notification mechanism. The localization problem itself
takes a certain amount of time as you mentioned.  Feedback from our hardware
experts says that it's doable in the range of a few milliseconds.

[George] I think you are confusing detecting a fault with locating the fault. The first is fast and easy, while the second is somewhat more interesting. To be most efficient with restoration resources, you really need to know where a fault is to a single link. Wether this can be done quickly is very technology dependent.


It makes no mention of the
inaccuracies in network planning databases, which make one wonder
whether precomputation of restoration paths will actually lead to faster
restoration times.


[Richard] Restoration path computation relies on some amount of accuracy no
matter when it is done, whether before or after the fault. Since one is
using the same database in both cases, precomputation will lead to faster
restoration time.

[George] Actually the database used for precomputation of restoration is the planning database, which knows about diversity, and is frequently incorrect (at least according to the operators I spoke with quite a while ago while I was a strong proponent of pre computed restoration). It is not the routing database, which has accurate topology but no knowledge of diversity.


Finally, it seems to presuppose that a network
operator would make such a facilities database available to route
computation at all. The suggestion in sect 6.2 that the physical length
of the fibers be available for route computation is very unlikely in any
network I have ever worked on.


[Richard] In the past, with no need for such information it may have been
irrelevant to provide it. For time-bounded shared-mesh recovery, this
information will be needed. It will afford the operator the sophistication
and bandwidth savings that shared-mesh provides.

[George] Restoration has been implemented in several technology generations without the need for this degree of detail. Operators tend to know how they run the network and are very reticent to make this sort of change. I think you are being overly optimistic.


3) .................


An
additional assumption seems to be that there is only one fault in the
network, and all bets are off if that is not true. There seem to be
problems with both these assumptions. It seems to me that there are no
mechanisms for truncating the PDU that is being sent, so there is a
finite chance that a significant extra delay is incurred. Perhaps more
serious is the assumption that all bets are off if there are multiple
faults in the network. In general, multiple faults are those that lead
to service outage. Two faults that do not interact, in that they do not
contend for the same network resources, will be coupled by the flooding.


[Richard] Multiple faults that do not interact could be coupled if they
occur in a time interval which is smaller than the delay of the flooding
message across the network diameter. Even in a large network, this implies
faults must occur closer than a few 100 ms apart.


In any case, please note that all bets are not off when it comes to FNP
conducting the notification.  FNP will achieve the notification irrespective
of the number of faults.  In the case of multiple faults, the timing bound
may not be guaranteed, if the common case one designs for by using FNP is
for a single fault. There is no restriction in the protocol itself not to
work with the assumption of multiple faults.  Moreover, multiple faults may
occur in less than 1% of the fault cases according to a major US carrier we
talked to.
SONET and other transport technologies only guarantee hard timing bounds in
the case of single failures. Our approach affords us a better recovery
procedure with proper planning.


In addition, unsupressed restoration requests, which occur when the
fault cannot be rapidly located to a single span, will also generate
restoration messages.


[Richard] Please refer to earlier answer about localization

[George] The all bets are off refered to the statement that restoration times are no longer ensured when there are multiple faults. This actually oocurs when a large fiber cable is cut and every system on that cable issues alarms. As the cable usually supports in dependent line systems, correlation across these systems is rarely, if ever, done. The result is a storm of restoration requests. If your reference operator suggested otherwise, I suspect that a different question was being inadvertently answered.



5) Until the effect of network database inaccuracies on the
effectiveness of  precomputed restoration is better understood, the
problem of allocating  resources in shared mesh networks is solved, and
it is certain that all faults will be located to the correct span in a
time useful to restoration, it seems to be premature to be proposing a
solution to the final piece of the problem.


[Richard] I believe we've answered each individual point in that sentence in the previous paragraphs. Given that, none are a stumbling block to the solution.

[George] I suppose it depends on what you consider a stumbling block!

Thanks for your thoughts - I also look forward to your reply to Deborah.

Regards

George