[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: comments on draft-ietf-psamp-sample-tech-02.txt



Hi Maurizio,

Thanks for your comments.
See inline.
 Hi Benoit,
see my comments inline.
I erased the points where I don't have comments, meaning that I agree with you suggestions.
Regards,
Maurizio

Benoit Claise wrote:

Dear all,

Here is a list of comments on the sampling and filtering techniques version 2 draft.
As always, feel free to start a new thread on a specific topic discussed below, with a new email subject.


4.

Section: terminology
I'm wondering why some terms are not copied over from the draft-ietf-psamp-framework-03.txt.
For example, the observation point which is referenced more than once in the draft.
For example, the oberved packet stream which is quite essential but never referred to in this draft
   (see one of my remark below about it)
etc...
So why not copy over the entire section?

MM: we prefered not to do it, but if misaligement is found, as you did, it may be better to copy, as you suggest...
My experience with IPFIX is that the terminology consistency across the different drafts is a painfull and lengthy process...
I'm just trying to help by giving a simple procedure.



11.
Section: Scope and Deployment of Packet Selection Techniques
 Note that a common technique to select packets is to compute a hash  function on some bits of the packet header and/or content and to  select it if the result falls in a certain selection range. Since  hashing is a deterministic operation, it is a powerful mean to ensure  that the same packets are selected at multiple measurement points.  Depending on the chosen input bits, on the hash function and on the  selection range, this technique could also be used to emulate the  random selection of packets with a given probability p. Hashing is  then a particular type of filtering, but can also be used to emulate  random sampling.
I would rewrite this with the terminology section in mind: hash-based selection, hash domain, hash range, hash function, hash selection range

Something like
 Note that a common technique to select packets is to compute a Fash  Function on the Hash Domain (some bits of the packet header and/or content) and to  select it if the Hash Range

MM: No, if you look in the terminology Hash Range means Hash co-domain, i.e. the range of values the hash can take. So here you must say something like "hash value", or "hash result" as in the original text.
You are right! Remember I put "something like" ;)
But the initial point remains valid: please rewrite this with the terminology section in mind


falls in the Hash Selection Range. Since  hashing is a deterministic operation, it is a powerful mean to ensure  that the same packets are selected at multiple measurement points.  Depending on the chosen input bits of the Hash Domain, on the Hash Function and on the  Hash Selection Range, the Hash-based selection could also be used to emulate the  random selection of packets with a given probability p. Hashing is  then a particular type of filtering, but can also be used to emulate  random sampling.

13.
Section: Scope and Deployment of Packet Selection Techniques
 We consider packet selectors as part of an IPFIX metering process  which also can use the IPFIX exporting process. This is expressed as  association to one or more IPFIX processes.
I think this notion above is essential but shouldn't it be part of the framework draft draft-ietf-psamp-framework-03.txt instead of this draft?

MM: I think this is right, but also a delicate point. I think we should avoid sugesting that packet samping is ONLY a support fo flow monitoring.
I agree, this should  be clarified in the FW draft.



18.
Section: 3.1.2.2.3      Non-Uniform flow State dependent sampling
 Another type of sampling that can be classified as Non-Uniform _(and,  possibly, probabilistic)_ is closely related to the flow concept as  defined in [QuZC02], ...

I don't understand "(and, possibly, probabilistic)"  because we are already under the probabilistic sampling chapter 3.1.2.2

MM: actually you're right, the "flow state dependent sampling" should be always "probabilistic", in the sense I explain below.
"Flow state dependent sampling" should help a system which has scarce (fast) memory  resources in  "intelligently" selecting the
packets that can create/update  a flow record. This, because there may not be enough memory to hold a new flow record.
A possible "rule" (or algorithm) for this packet selection  can be:
- if a packet accounts for a  flow record that already exists, select the packet (i.e. simply update the flow record)
- if a packet doesn't account to any existing flow record, select it with probability p and create a new flow record for it.
This is actually the algorithm called "sample and hold" in the cited reference [EsVa01] , but other algorithms may be implemented.
But whatever algorithm, even without an explicit "probabilistic" concept, will never guarantee deterministically (i.e. with p=1)
that the packet will be selected, because there may be no room to create a new flow record for it. This is how I would modify section 3.1.2.2.3.

 Another type of sampling that can be classified as probabilistic Non-Uniform  is closely related ...[same text as now].....Packets are selected dependent on a  selection state ...[same text as now]....  An example of such an algorithm is described in [EsVa01].
Ok. And why not add the example you wrote above.
A possible "rule" (or algorithm) for this packet selection  can be:
- if a packet accounts for a  flow record that already exists, select the packet (i.e. simply update the flow record)
- if a packet doesn't account to any existing flow record, select it with probability p and create a new flow record for it.
I find it clear, easy to understand, and it doesn't require to read the reference.


Following paragraph is an attempt to clarify the probabilistic/deterministic issue.
It may be added at the end of 3.1.2.2.3 if deemed necessary, otherwise just drop it...
 We classify this sampling as "probabilistic" because whatever algorithm can  increase or decrease the probability that a packet is selected, but if the
 memory for keekping the flow records is limited it can never be guaranteed
 with probability equal to one (i.e. deterministically) that the packet can be
 selected (if the corresponding flow records cannot be ceated). 
I don't think this is necessary to add this paragraph.
I personallly find it confusing to mix probability with the limited flow cache size. There are flow expiration mechanisms in IPFIX that will prevent this mechanism.


19. We wrote in the terminology section:
selection based on packet content = filtering.

MM: please note that the terminology section says
a filter is a selection operation that selects a packet  deterministically based on the packet content, its treatment, and  functions of these occurring in the selection state

So the condtion for defining a selection a "filtering" is "packet content" + "deterministic". Sampling on the other hand is defined as "NOT filtering", which means that the following combinations
- Probabilistic AND pk content
- Probabilistic AND (NOT pk content)

are considered sampling. Then I think the classificatio we made is coherent with the definition.
Ok. The "deterministic" term should be clarified in the draft in the new section "3.1.3 sampling and packet content" proposed below, exactly as you just did here!

However, I thing that your proposal of making a small section further clarifying the issue makes sense,
and the table as well helps in making the classification more understandable. I just added a few comments
inline below on the parts of your suggestion I don't fully agree with.


But in section 3.1.2.2.3, we also wrote _  This type of sampling is also  content dependent because the identification of the flow the packet  belongs to requires analyzing part of the packet content_.

And _  n-out-of-N sampling and uniform probabilistic sampling are contentû
 independent selection schemes. For non-uniform probabilistic sampling  the sampling probability can be based on packet content. _

I would create a new small section "3.1.3 sampling and packet content", that would explain something like this:
The terminolgy sections defines:
 Filtering: a filter is a selection operation that selects a packet  deterministically based on the packet content, its treatment, and  functions of these occurring in the selection state. Examples include  match/mask filtering, and hash-based selection.    Sampling: a selection operation that is not a filter is called a  sampling operation.
We can deduce that not a single sampling selection can be based on the packet content.

I don't clearly understand the line above. Is there a typo? If not, could you clarify what you meant?
You answered with you previous remark about "deterministic". Ok for me now.




Nevertheless, for the more advanced sampling selections, the distinction between sampling and filtering is becoming subtle.

And some selection operations classified as sampling could in reality be based on packet content.

I'd avoid saying the line above, because this is already allowed by the general definition


These shoud anyway be considered as exceptions.

see above.  According to our definition they're not exceptions


The table below summarizes the behavior of the different sampling operations
                                                                              |  content-independent  |  content-dependent      Sampling Scheme                |       sampling        |      sampling      --------------------------------+-----------------------+--------------------      systematic sampling:           |                       |       count-based                    |           X           |      --------------------------------+-----------------------+--------------------      systematic sampling:           |                       |         time-based                     |           X           |     --------------------------------+-----------------------+--------------------      random sampling:               |                       |        n-out-of-N                     |           X           |      --------------------------------+-----------------------+--------------------      random, probabilitic sampling: |                       |        uniform probabilistic          |           X           |        --------------------------------+-----------------------+--------------------         random, probabilitic sampling: |                       |       non-uniform probabilistic      |                       |          X     --------------------------------+-----------------------+--------------------      random, probabilitic sampling: |                       |       non-uniform flow-state         |                       |          X     --------------------------------+-----------------------+---------------------

Note: I'm almost sure that the table will not be formatted in the correct way, so I attached a version in word.
This word document contains 2 tables. The second one is the table of section 5 where the terminology has been slightly modified.

Also in the Section: Scope and Deployment of Packet Selection Techniques
 The selection technique used to select a subset of packets out of all  those crossing an observation point depends on the purpose  (application) for which measurement is performed. If the main purpose  of an application is to infer some characteristic of the whole set of  crossing packets without processing them all (thus reducing the  computation load) then we call the used selection technique  ôsamplingö. _In principle, with sampling the content of the packet is  not relevant for the packet selection_: what matters is only that the  selected sample has a distribution of the characteristic to infer  similar to the one of the parent population, so that it can be  estimated reliably. The sampling decision may be based on the  temporal or spatial position of the packet in the packet stream, or  may depend on a (pseudo) random number extraction or calculation.

I would add a reference to the new section.
 In principle, with sampling the content of the packet is  not relevant for the packet selection (see section 3.1.3 sampling and packet content): ...

MM: I agree, the section should be changed, but I'd have the following alternative suggestion (Tanja will probably want to comment on the proposed last paragraph....):
The selection technique used to select a subset of packets out of all those crossing an observation point depends on the purpose (application) for which measurement is performed. If the main purpose of an application is to infer some characteristic of the whole set of crossing packets without processing them all (thus reducing the computation load) then _it must be avoided to inspect the content of all the packets_ . This can be achieved by a content-independent sampling  _In principle, with _content independent_ sampling the content of the packet is  not relevant for the packet selection: what matters is only that the  selected sample has a distribution of the characteristic to infer  similar to the one of the parent population, so that it can be estimated reliably. The sampling decision may be based on the  temporal or spatial position of the packet in the packet stream, or  may depend on a (pseudo) random number extraction or calculation.
_Note that there are also sampling techniques that are dependent  on the packet content (see section 3.1.3 sampling and packet content). The advantage of such techniques is that they may have a better sampling efficiency (i.e. lead to an estimation of the statistics of interest with the same precision but a fewer number of samples). However, these techniques can be applied only when the ispection of packet content at full rate is feasible.




21.
Section:  4.2 Hashing filtering    A hash function h maps the packet content c, or some portion of it,  onto a range R. The packet is selected if h(c) is an element of S,  which is a subset of R called the ôselection rangeö. Thus hash-based  sampling is indeed a particular case of filtering: the object is  selected if c is in inv(h(S)). But for desirable hash functions the  inverse image inv(h(S)) will be extremely complex, and hence h would  not be expressible as, say, a match/mask filter or a simple  combination of these.
Like in my remark 11, it would be better to rewrite it with the terminology in mind: hash-based selection, hash domain, hash range, hash function, hash selection range

MM: ok for me, but with the same remark I made for pt. 11, i.e.
if you look in the terminology Hash Range means Hash co-domain, i.e. the range of values the hash can take. So here you must say something like "hash value", or "hash result" as in the original text.
yes.




23.
Section: 4.2.2 Consistent packet selection and its applications
Isn't it covered already in section 10.2 from the framework draft?

MM: yes, but 4.2.1 and 4.2.2 were introduced to give two examples of possible applications of hash based filtering. Perhaps a better structuring
to evidence this would be

4.2 Hashing filtering ....
4.2.1 Examples of hashing filtering application
4.2.1.1  Random sampling emulation ........
4.2.1.2  Consistent packet selection and its applications ...........
4.2.2  Guarding Against Pitfalls and Vulnerabilities ...........
4.3 Router State filtering




28.
Section: 5.1 Information Model for Sampling Techniques
 SELECTOR_PARAMETERS  Description: For sampling processes the SELECTOR PARAMETERS define  the input parameters for the process. Interval length in systematic  sampling means, that all packets that arrive in this interval are  selected. The spacing parameter defines the spacing in time or number  of packets between the end of one sampling interval and the start of  the next succeeding interval.
 Case n out of N:     - _List of n sampling positions in an array of N positions_    Case Systematic Time Based:     - Interval length (in usec), Spacing (in usec)    Case Systematic Count Based:     - Interval length(in packets), Spacing (in packets)    Case uniform Probabilistic(with equal probability per packet):     - Sampling probability p       Case non-uniform probabilistic:     - Calculation function for sampling probability p    Case _non-uniform_ flow state:     - Policy for selecting flows (e.g. give priority to large flows)

List of n sampling positions in an array of N positions:
What if we use random numbers? Exporting all random number (or the positions) doesn't make sense!
And with the random number of the positions, one could try to reverse engineer the function...
I think we must just export n and N and assume a good random number generation function!

MM: with your suggention, there wouldn't be any more the need of  differentiating n out of N from uniform probabilistic sampling.
The idea of keeping this differentiation is to keep the possibility to specify (or have information about) the exact sampling pattern
But I agree this may not be necessary and/or a good idea.


Minor detail, I would keep the selection operation order as defined in the table of content
29.
Section: 5.1 Information Model for Sampling Techniques
 OPERATING_TIME  Description: The OPERATING_TIME parameter describes the start/stop  time of sampling process. List elements must not overlap. The start  time of the first element can be omitted, meaning ôfrom nowö. The end  time of the last element can be omitted, meaning ôuntil sampler is  removedö.
 Values: List of (Start time, End time) Why are these values interesting to report?
Unless you want those for configuration, i.e. I want to enable this sampling function for 10 minutes starting tomorrow at noon.
I'm not sure this is interesting!

MM: mmmhhh.... why not? I think scheduled measurement are useful....
I actually expressed myself badly. We will have a MIB to configure the sampling functions! So you could use the NMS application to setup all the routers at the same time. So why add some pure configuration parameters in the PSAMP information model? The PSAMP information model should deal with what is exported!
So if the WG deems this feature important, I think it should be in the MIB anyway, not in the information model.

Now, the only limited advantage for scheduled measurement in the PSAMP MIB itself is that you could start a trajectory sampling application precisely at the same time on all the routers (if the clocks are synchronized)... in case the time of the SNMPSet packets going from the NMS to the different routers is not a precise-enough measurement scheduling.

Regards, Benoit.