[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Control plane resiliency

To: "Adrian Farrel" <adrian@olddog.co.uk>, <ccamp@ops.ietf.org>
Subject: Re: Control plane resiliency
From: "Igor Bryskin" <ibryskin@movaz.com>
Date: Tue, 1 Nov 2005 11:08:12 -0500
References: <0aef01c5de6b$d0a61760$de849ed9@Puppy>
Wow, Adrian, do you believe you were sufficiently rude to protect against
any future DOS attacks on the IETF process? I guess this is your idea of
firewall.



See in-line.

 Igor



----- Original Message ----- 

From: "Adrian Farrel" <adrian@olddog.co.uk>

To: <ccamp@ops.ietf.org>

Sent: Monday, October 31, 2005 5:37 PM

Subject: Control plane resiliency

 >

>      Hi,
>
> It is nice to see some traffic on the CCAMP list.
>
> Some observations based on your discussions (in no particular order).
>
> 1. There is a distinction to be made between control channel resiliency
> and controller resiliency.
>
> Since the control channel uses IP, it will heal so long as there is IP
> connectivity (failing that you're dead anyway); the issue is the speed of
> convergence. Other solutions exist to provide control channel redundancy,
> but I believe that these are either implementation specific or are
> hardware specific. In either case, they are out of scope and (as Zafar
> says) the IP connectivity would remain.



IB>> Agree


>
> Controller resiliency is actually a debate about how to manage the data
> plane in the absence of a controller. Note that it is perfectly possible
> to lose your signaling controller without losing your routing controller,
> etc.



IB>> Here I'd like to summarize what we cannot do if one, say, LSP transit
controller goes out of service while the LSP data plane preserved intact - a
very possible situation in GMPLS CP controlled transport network:



  1.. tear LSP in upstream or downstream direction
  2.. use PathErr notifications
  3.. modify any LSP attribute (some of these modifications could be useful
even in this case):
a)      Admin status (e.g. enable/disable data plane alarms);

b)     ASSOCIATION object ( e.g. moveVCAT group constituents from one group
to another)

c)      SESSION_ATTRIBUTES (e.g. setup/holding priorities);

d)     POLICY_DATA object (e.g. signal policy change)

e)      NOTIFY_REQUEST object (e.g. change Notify targets)

f)       Etc.

  4.. gather alarm information;
  5.. use restoration schemes that depends on hop-by-hop
notification/switchover signaling
  6.. In case of P2MP LSP a failure of a controller located close to the
root prevents grafting/pruning operations;
  7.. In case of P2P or P2MP segment protected LSP a failure of controller
prevents adding/removing local protection


In summary - pretty much anything. One solution is mb4b rerouting. However,
it may not be feasible. Think of a ring topology: there could simply be no
resources in the opposite direction of the ring.



Hence our choices are:

1)      To say we cannot do anything because GMPLS control plane emerged
from MPLS where this problem does not exist - just wait for controller to
repair and in the mean time manage the LSP via management plane

2)      To do something about this: there might be solutions that relatively
easy address this problem. (I have at least two of them)




>
> 2. There is no need to make a distinction between data plane technologies.
> That is, for GMPLS this applies equally to PSC and non-PSC.



IB>> Well in IP where the control plane is congruent with the data plane the
situation I described does not exist, so I disagree


>
> 3. Dimitri's observation that it may be worthwhile documenting the
> concerns and explaining why they are not issues (i.e. how they solve
> themselves) may be very valuable. It should suit all people because if the
> concerns are real then the attempt at documentation will expose them,
> while if the concerns are not real we will have an advisory document for
> all time.



IB>> Good advise


>
> 4. *The* controller resiliency issue appears to be: How to manage and
> teardown an LSP downstream of a broken controller. I do not understand why
> the answer is not one of:
> a. Wait for the controller to be repaired
> b. Use the management plane
> It seems that in-place modification of GMPLS LSPs is not a common
> operational feature for an active LSP. This leaves us with teardown. For
> teardown, option a does not seem to be very painful, but if it does cause
> problems (and alarms *will* be raised) option b is available.



IB>> Not just that, see my comment above


>
> 5. The ability to manage the data plane resources at the node where the
> control plane has failed, may still exist in the management plane (in
> which case use it!) or may also be gone (in which case, who cares that you
> can't free up the resources for another user since another user cannot
> make use of the resources). You *might* manage to claim that there is a
> safety issue here - anyone know what OSMINE says about loss of management
> plane connectivity to a device with lasers?
>
> 6. There may be a *separate* issue (as raised by Tom) about how to synch
> the control plane with the data plane. This might be particularly
> important after controller recovery. Such OAM features should be part of a
> separate effort (such as a GMPLS OAM requirements draft).
>
> 7. Lyndon points out that there is scope for a "standby signaling
> controller". This could mean two things.
> a. An implementation option (as John points out) including, but not
> limited to:
>    i. multiple instances of the software component
>    ii. multiple instances of a controller CPU
>    iii. control-plane message replication to distinct devices
> b. A protocol option where two entirely different "routers" manage the
> same data plane switch. This is pretty much outside our architecture
> (although we could handle a "replacement signaling controller" if it
> turned up with the same addresses). *If* someone wants to work on this
> idea, I suggest they need to develop the requirements and convince the
> community that what we have already doesn't provide everything we need
> anyway. (Not saying that the standby model is wrong, but am suggesting
> that we might not have a need for it at the moment.)
>
> 8. In a transport network, we must accept an operator's right to entirely
> eliminate soft state timeouts. That is, the failure of the control plane
> for any length of time, must not impact the traffic (note that even though
> RSVP is a soft state protocol, the refresh period may be set so high (32
> bitsworth of milliseconds) as to be practically infinite. Note
> specifically that the failure of Hellos is used in RFC3473 to temporarily
> "turn off" the soft state nature of RSVP-TE.
>
>
> In summary:
>
> A. We have decided that observation 3 might be useful. Is anyone planning
> to work on this?
> B. We have not heard from Young since the start of this debate. I hope the
> hot words have not driven him away.
> C. Igor appears to be out on a limb here. While I appreciate him pushing a
> point that he believes in strongly, such determination in the face of WG
> consensus approaches a DOS attack (unintentional, I'm sure) on the IETF
> process. Please be careful.



IB>> Thanks for the advice, but to be careful is not in my nature.

  1.. About the DOS attack: there was the discussion about control plane
resiliency (which I didn't initiate), I mentioned that IMO the issue is
important and when I was asked to give a specific example, I brought up the
problem of control plane partitioned LSPs. Since then I sent only responses
to the questions, some of them unrelated like for example " Can you really
keep states in the absence of refreshes" and "Why is your problem cannot be
resolved via RSVP graceful restart". Isn't this is CCAMP mailing list for?
Where do you see the DOS attack?
  2.. About consensus. I was in constant discussions with 3 people: John,
Dimitri and Zafar. I definitely could not convince John; Dimitri said at the
end: "Ok, it is time to write a draft, and we shall see"; and I seem to give
all answers to Zafar, but am not sure if succeeded to convince him or not.
Suppose none of them bought in, then, including you, Adrian, there are 4
people who did not recognize the problem sufficient enough to work on a
standardized solution. On the other hand one person - Neil - openly
supported the work, which with me makes us two, plus, I got some supporting
private emails (including an invitation to submit a contribution to Q14/15).
So, Adrian, please define what you mean by consensus. Maybe we should vote?
I am pretty sure you will get the votes, however, I don't think that as of
yesterday one could call the situation as a consensus.


Igor


>
> Thanks,
> Adrian
>
>
>
>
>
>
>
References:
- Control plane resiliency
  - From: "Adrian Farrel" <adrian@olddog.co.uk>
Prev by Date: Lambda labels - next steps
Next by Date: Re: Control plane resiliency
Previous by thread: Re: Control plane resiliency
Next by thread: Re: Control plane resiliency
Index(es):
- Date
- Thread