[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Fwd: Re: [RMONMIB] I-D ACTION:draft-ietf-rmonmib-raqmon-pdu- 08.txt]



 

> -----Original Message-----
> <soapbox>
> Why does it seem like every couple years the RMON WG pushes
> the SNMP envelope, and keeps running into "CLR roadblocks"?
> The standards are supposed to serve users, not the other
> way around.  It seems to me that any effort spent
> devising detailed rules around SMI usage (to prevent users
> from "hurting themselves") is totally pointless, especially
> in the absence of any real evidence of a problem to solve.
> Here's a litmus test: What operational problems are being
> solved by preventing somebody from defining a table of
> accessible-for-notify objects?  Can't think of any?  Then 
> lose the CLR!
> </soapbox>

Overall, I really hate discussion about CLRs atop CLRs atop CLRs. I
feel like we're navel gazing rather than doing something productive to
make SNMP a more useable protocol for operators.

However...

I have a concern over whether tables full of accessible-for-notify
objects obscures the fundamental trap-directed-polling philosophy of
SNMP.
I think doing this is bad practice.

I do not like what the RAQMON MIB does; it should send a simple single
notification to the manager saying it has some information for the
manager, and then let the manager poll for the rest of the data. Dan's
argument is that the devices are very limited and sending the
notification is simple; Marshall's Simple Book seems to disagree that
an event-driven approach is simple. The reason SNMP is used for RAQMON
is because it is already on the device. Well, if it's on the device
already, it probably supports polling already, so using the polling
approach should not be detrimental. If the goal is real-time reporting
of events, I don't feel comfortable that using SNMP this way is a wise
choice.

IN a RAQMON system of many IP phones, all sending large notifications
to a collector, will the collector be able to keep up? How many phones
can one collector handle before becoming swamped? If trap-directed
polling were used, the collector would only need to process a simple
trap and to queue up the request to poll for more information; it can
choose its timing rather than constantly being forced to stop
everything to handle the interrupt. With traps, the OS pre-emptively
takes control from the application; with trap-directed polling, the
application retains better control over the context switching. SNMP is
not really well-designed for real-time event-driven management; a
stream based session-based protocol like LFAP (in the IPFIX WG) would
seem a much better approach.

Dan tells me that Bert, Steve Waldbusser and Andy all have accepted
this approach for RAQMON. So be it. I don't care enough about the
RAQMON case to go to the RMON WG and challenge it.

But until RAQMON becomes widely deployed in real-world networks, with
real-world applications handling this load, I would not like to change
the guidelines to recommend, or even imply a recommendation, for such
an approach. Real world experience argues that this approach may not
be scalable.

Adding text to the guidelines saying "this is how to build tables of
accessible-for-notify objects" implies this is acceptable practice.
If this is ever published as a BCP, that implies it is a BEST current
practice.
I really feel uncomfortable with anything that encourages this
practice.
I would prefer to not make such a change at all, and to generally
discourage the practice.
Part of my reticence is experience with Spectrum, a full-blown
platform normally capable of managing tens of thousands of SNMP
agents, where one customer decided to use SNMP traps for event-driven
management and totally overwhelmed the application with notifications.

Rate limiting ala RFC3413 might have prevented the problem to
Spectrum, if only one device needed to be rate limited. But the
problem wasn't one agent sending to many traps. Spectrum's customer
configured the network to send lots of traps to Spectrum, from
multiple agents they designed themselves with large varbind lists.

Each notification interrupted Spectrum processing, as expected. Each
trap caused the creation of a thread to process the trap.
This worked fine in a normal SNMP environment with tens of thousands
of devices sending small notifications to direct polling activities.
The problem is that the agents, not being aware of the impact they
would have on the network and the application, together sent hundreds
of traps per second, trying to report events in real-time. Traps came
in so fast and each trap required so much processing time to handle
the large list of varbinds that the threads kept being interrupted;
the system was so bogged down responding to interrupts and creating
new threads for new traps it never had time to actually finish
processing the traps already received. Ultimately it ran out of thread
space and stopped creating new threads, but still could never get back
to processing the already-received traps because it was constantly
being interrupted. 

What was needed was to educate the customer that SNMP is not designed
to be used that way, and to have them use trap-directed polling
instead. This solved the problem.

Maybe this is not really a problem any longer, but the fact that
Marshall discusses this is his book makes me believe that SNMP was
designed to use trap-directed polling for a very good reason.

We should recognize that SNMP was designed to use trap-directed
polling, and changing SNMP to be event-driven could be a serious
design issue. If the majority of MIB Doctors, especially those with
manager-side experience and not just agent-side experience, believe
this is not a problem and event-driven management using tables of
notifications is scalable, then I'll shut up. But I think it is
important to discuss real world experience with this approach rather
than remaining quiet just so the RMON WG won't feel we're constructing
CLR roadblocks.

David Harrington
dbharrington@comcast.net