[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Standards for IP stats collection? (corrected)



Hi Richard, thanks for your comments....my response below:

<snipped>
> I think (but am not quite sure from your wording) that you 
> find the goal of
> separating
> failure analysis at the physical/link layers from 
> availabiltiy & QoS at the IP
> layer unattainable.
NH=> Sorry I was not clear Richard.  One can separate availability and QoS
for CO entities.  How easy this is in practice depends on whether one has
started with a clear architectural understanding of the nature of the
problem, and thus ensures that functional measurements (of defects) are
based on pragmatic OAM tools, cf how difficult availability measurements are
for ATM (ie after 10 years study this could not be done.....since it was
based on some fundamental architectural errors of judgement at the outset
(and BTW too much emphasis on the I.356 QoS metric wish-list that would
never be cost-effective to measure) that made it too complex to solve) vs
what is in our recent MPLS draft.
With a CNLS layer the problem is not clear-cut due to the intrinsic
dependency of the user-plane on a stable/valid control-plane.....this is
*not* necessarily a weakness, nor should it be viewed as such (indeed it is
a strength for *certain* types of traffic).  The simple fact is that CO
networks and CNLS networks behave differently.  In the former we have the
opportunity to separate availability and QoS and in the latter we can't (at
least within the network, as ad hoc partitions of some HRX - hypothetical
reference connection) except on an end-end basis.  The basic behaviour
difference here is that on failure:
-	CO fabrics only affect the customers whose traffic is carried by the
failed trail(s), ie this traffic dies, and other non-affected customer
traffic sees no avail/QoS deterioration;
-	CNLS fabrics mutate failures into QoS hits since the overall traffic
demand is not constrained post-failure.....DiffServ then skews this
relationship (noting that there is no relationship between a traffic's
survivability requirements and its up-state QoS requirements, eg voice must
go EF for delay/jitter transfer requirements, but whether a *given* voice
traffic source is mission critical (ie must survive on failure) or not
cannot be inferred from the DS codepoints alone.  This can be expanded to AF
classes, ie there is no QoS/survivabiity relationship.


> My view is that these functions of network state are 
> inter-related and to seek
> to make them
> independent is quixotic.
> Here is a view:
> The various layers below IP will have outage phenomena. These 
> are measurable,
> albeit imperfectly.
> Both control-plane and data-plane will have outage phenomena. 
> These are
> measurable, I think.
NH=> Yes they are.  The key issue is whether one affects the other.  In CNLS
fabrics this is an intrinsic behaviour, in CO fabrics it does not have to
be.  For example, in plain IP the user-plane is effectively 'dead' for as
long as the IGP is incorrectly converged (eg post failure), whereas in
ER-MPLS or GMPLS (for example) there can be disjoint routed
user/control-planes (indeed, in GMPLS the control-plane will use a logically
seperate network from the user-plane, whose only survivability design cues
*must* come from the 'duct' layer), and in particular a key operator
requirement here is that failures of the control-plane *must not* affect the
user-plane.

> ("Outage" includes route stability loss during convergence, 
> for conventional IP
> routing. Ex.: Vern Paxsons 1996 paper on internet routing behavior.)
NH=> Yes, but that is not the whole story.  It is the fact that
post-failure/re-convergence there is no reduction in the traffic offered,
but rather a sharing of reduced network resource over the same traffic
demand.  It is this fact (coupled to the harder (to measure) initial outage
event) that creates a real difficulty of separation of availability and QoS
specification.  Indeed, this looks quite problematical for VPNs *if* we do
not have the ability to decouple VPN topological/resource survival (of LSPs)
from basic forwarding treatment (of packet aggregates from multiple
different VPNs).
 
> MPLS will have separate, but commensurate outage behavior 
> (despite fast
> reroute).
NH=> True for LSPs based on an IGP, eg LDP.  Not necessarily true for
ER-LSPs based on RSVP or CR_LDP.  And certainly not true for GMPLS when we
consider SDH/OTN fabrics.

> All of this feeds into the analysis of QoS at the IP layerand above.
NH=> True....but I would put it a different way.  A client layer network
*inherits* the performance of *all* the server layers below it...and then
adds its own performance impairments, which are only relevant to the client
layer considered.  This relationship recurses upwards.  The inheritance is
not linear in some cases, eg a us/ms error event in SDH can create several
seconds of outage in some client layers.....the actual behaviour depends on
the robustness of the client layer framing, and especially the
server->client adaptation functions, eg if ATM was a client of SDH then one
can create a loss of cell delineation event which can lead to a variable ATM
layer outage depending on what OAM functions are active (and in some cases
user traffic activity).

> QoS is both an IP connection availability (reachability) 
> issue, as well as a
> packet loss, delay etc. issue.
> Up/down state transitions can generally cause both (a) 
> reachability loss, and
> (b) data loss.
> So, *given adequate measurement and characterization* of the 
> phenomena,
> I don't see it as impossible. If this is naive, why?
NH=> Yes you can measure both IP availability and QoS but you can do so only
reliably/consistently at the end-points, ie the true source/sink of the IP
'connection' one is considering.  We cannot partition that end-end IP
'connection' since any arbitrary partitions we create cannot be regarded as
permanent, ie the routing can change post-failure (or indeed if new routes
are injected in the IGP/BGP).  This makes the definition of a global HRX
with end-end metric/objectives apportioned (on some basis, eg proportional
to length) amongst consitutent subnetwork partitions impossible....thus we
cannot realistically relate network availability/QoS to individual customers
on a subnetwork/partitioned basis. 
> 
> >
> > 3       Now we can consider QoS.....but be careful.  It 
> costs money to
> > measure/collect/process (in OSS) these....every technology 
> I have been
> > involved with has started with a large metric wish-list 
> that gets whittled
> > down to something more pragmatic later.  My advice is that 
> these should be
> > of 2 types of QoS metric collection: (i) ad hoc 'sw-on' function for
> > trouble-shooting as needed by operational people, or the continuous
> > measurement of 'important' paths, and (ii) general network 
> population
> > sampling (to get overall network trends and spot latent anomolous
> > behaviour).
> 
> Yes. The problem as I see it here is that many of the 
> measurement stds impose
> invasive problems on the network. RMON and general polling 
> via SNMP are
> examples.
NH=> The overriding concern I have as an operator is the measurement
cost/benefit ratio.  I have seen (too many times) over-enthusiastic and
unrealistic QoS measurement requirements stated (usually by those who do not
have to pay for it!) *and* a lack of attention to what is really important
to operators/customers, eg is it 'working'? can we detect/diagnose defects?
can we clearly articulate/measure availability?
> 
<snipped>
> > BTW - We (ie me/Shahram Davari/Ben Mack-Crane/Peter Willis) 
> have just posted
NH=> I should have also added Hiroshi Ohta (NTT) to the authour
list....sorry Hiroshi from missing you off initially.
> > an ID for MPLS user-plane which deals with 1 and 2 above.
> 
> I would be interested in a link to your draft.
NH=> http://www.ietf.org/internet-drafts/draft-harrison-mpls-oam-00.txt
Note that some of the more complex diagrams are missing in here....if you
want to see these let me know off-list (to properly understand the flow
charts of near/far-end defect processing of LSPs and how to distinguish a
short-break from an unavailability event you may need them).  I have also
noted several typos that will need correcting when the ID gets up-issued.

<snipped to end>