[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: I-D ACTION:draft-soumiya-lmp-fault-notification-ext-01.txt
Hi Adrian,
Thanks a lot for your careful reading of the draft, and for
your comments.
Please see explanations in-line, and comments welcome!
-Vishal
> -----Original Message-----
> From: Adrian Farrel [mailto:adrian@olddog.co.uk]
> Sent: Wednesday, July 02, 2003 7:57 AM
> To: soumiya.toshio@jp.fujitsu.com; rabbat@alum.mit.edu
> Cc: ccamp@ops.ietf.org; thamada@fla.fujitsu.com; kanoh@jp.fujitsu.com;
> Vishal Sharma (E-mail 2)
> Subject: Re: I-D ACTION:draft-soumiya-lmp-fault-notification-ext-01.txt
>
>
> Hi,
>
> Two brief questions about this draft...
>
> 1. In section 3 you say
>
> [Optional]: If the receiving node has activated one or more recovery
> paths, it sends a RecoveryCompleteNotify message to either the egress
> nodes of the recovery LSPs or to the NMS. It continues sending
> RecoveryCompleteNotify messages periodically until it receives a
> RecoveryCompleteNotifyAck message or a timer to retry sending
> expires.
>
> ...Isn't this a change to the way LMP operates? That is, before this
> message, LMP is a neighbor-to-neighbor protocol.
I think the thought here was merely that it would be nice to have some way
of
informing the NMS that a recovery action in response to a fault has
been completed. And, for that reason, it's just an optional action.
The draft currently does it using LMP extensions, merely for convenience
(which as you rightly observe requires an extension to LMP). However,
the recovering entity-to-NMS communication could, in fact,
be implemented pretty much using any of several communication mechanisms
that
the system designer and/or service provider deems appropriate.
We thought this was a useful function, but we're open to hearing
others' opinions about it.
> 2. Fault repair
>
> I don't see anything in your draft that discusses fault repair. How
> does the reporting node revoke its fault report?
The reporting node will invoke "fault repair" using normal channels
(possibly just the routing protocol in use in the network). This is
because the "fault repaired" info. isn't really time critical. It's
only purpose is to make sure (as you observe below) that repaired
resources don't remain stranded indefinitely.
I think the existing P&R drafts don't make explicit mention of
fault repair either, probably for this very reason. (They also assume,
like we do, that the "fault repair" indication is a background activity
that happens via normal channels.)
> This would appear to be a requirement otherwise repaired
> resources will never be made available again. I think that when
> you add this function and add the necessary controls to ensure
> that each node has the right state (fault or no fault) you will have
> invented a link state protocol.
Based on the explanation above, we don't really see the need to add
this functionality to a fault notification protocol of the type proposed in
draft-rabbat-fault-notification-protocol-03.txt.
So fault notification retains it lightweight character, as desired.
> At this point I don't see what LMP gives you that an existing
> link-state protocol doesn't already deliver. Certainly the
> speed of reporting of faults to every node in the network
> will be lost once you have to prevent "thrash" of fault and
> fault clear notifications.
Given that the "no fault" indication isn't time critical, as discussed
above, one still derives the benefit of lightweight fault notification
using LMP.
Some of the drawbacks of adapting a link-state routing protocol to
achieve such notification were discussed in the thread a few ago,
in response to Roberto Albenese's email (which also pointed out
the difficulty of scaling a signaling-based notification scheme
when there are multiple LSP/lambdas on a link, and they don't
share the same ingress/egress pairs, thus reducing the effectiveness
of bundling Notifys.)
See thread here:
http://ops.ietf.org/lists/ccamp/ccamp.2003/msg00682.html
http://ops.ietf.org/lists/ccamp/ccamp.2003/msg00683.html
http://ops.ietf.org/lists/ccamp/ccamp.2003/msg00684.html
Finally, the "thrashing" you refer to would be an issue even in
signaling-based notification. If you send a Notify message to a
repair point and then learn that the fault has been repaired,
what does a detecting entity do? I think it pretty much allows
the recovery action to proceed.
The assumption here is that there is a fault correlation phase
preceding the notification phase. If the fault correlation fuction
determines that a fault has occured that requires the initiation of
a recovery action, I would think the action wouldn't really be
turned off midway. Rather, the recovery action would proceed, and,
upon learning of the fault repair (via normal channels), policy
would have to decide how to use the repaired resources.
> In any case, it is not clear to me that
> every node in the network needs rapid notification of faults -
> only those nodes that constitute repair points for the LSPs
> that used the failed resource need to know quickly and they
> hear about the problem through a directed Notify message
> that must propagate faster than any hop-by-hop protocol
> can.
Actually, that's an interesting point, probably worth discussing.
The flooding-based notification ensures that the fault notification
gets to the repair point in the minimum number of hops. (The signaling-based
notification allows the message to follow the "shortest path" to the
repair point, where the shortest path is based on metrics for plain IP
routing, which don't necessarily ensure that it gets there in the least
number of hops; it only ensures that it gets there following the "shortest
path" per the applicable metrics.)
The second observation is that in shared mesh recovery, not only repair
points but all intermediate nodes along the recovery path need to learn
of the fault (so that they may reconfigure their cross-connects to carry
the working traffic coming down the recovery path).
Using signaling-based notification, this requires an additional 2 or 3 phase
handshake between the LSP end-points, which lengthens recovery time.
Additonally, there is the issue of scaling signaling when a large number
of LSPs are affected by a given fault (e.g. a fiber cut).
However, using controlled flooding-based notification, as proposed
in draft-rabbat, it is shown (in the draft) that by an adequate choice
of recovery paths, it's possible to ensure that all nodes along the
recovery path learn of the fault (on the working path) by the time the
end-nodes learn about it. This has two benefits: multiple nodes along
the recovery path that learn of the fault may reconfigure themselves in
parallel (saving time), and no additional handshaking is needed once the
repair
point learns of the fault.
And, of course there is the potential advantage that flooding allows for
(in the worst case) no more than one message per link per fault, whereas
signaling
will require (in the worst case) a number of messages equal to the no. of
LSPs to
be sent by the detecting entity to the LSP repair points.