[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

comments on draft-ietf-psamp-sample-tech-02.txt



Dear all,

Here is a list of comments on the sampling and filtering techniques version 2 draft.
As always, feel free to start a new thread on a specific topic discussed below, with a new email subject.

1.
Section: Abstract 
   
  This document describes sampling and filtering techniques for IP 
  packet selection. It introduces information models for packet 
  sampling, for packet filtering and for combinations of methods. The 
  information models describe what information has to be specified in 
  order to describe the method. This information is used for 
  configuring the selection technique in measurement processes and for 
  reporting the technique in use to the measurement data collection 
  process.   
  The document first suggests some terminology, then it describes in 
  detail packet sampling and packet filtering techniques and their 
  parameters. It also describes how these two techniques can be 
  combined to build more elaborate packet selectors. Finally, it 
  introduces information models for the most common sampling and 
  filtering techniques. 

The last sentence is a duplicate.

2. 
Section: Abstract 
   
  This document describes sampling and filtering techniques for IP 
  packet selection. It introduces information models for packet 
  sampling, for packet filtering and for combinations of methods. The 
  information models describe what information has to be specified in 
  order to describe the method. This information is used for 
  configuring the selection technique in measurement processes and for 
  reporting the technique in use to the measurement data collection 
  process.   
  The document first suggests some terminology, then it describes in 
  detail packet sampling and packet filtering techniques and their 
  parameters. It also describes how these two techniques can be 
  combined to build more elaborate packet selectors. Finally, it 
  introduces information models for the most common sampling and 
  filtering techniques. 
The framework draft draft-ietf-psamp-framework-03.txt speaks of a collector, which I think is preferred.
Surely several instances in the draft of this "measurement data collection process"...

3.
Section: terminology 
I made several comments in the email with subject "comments on draft-ietf-psamp-framework-03.txt" sent on september 18th.
Obviously, the comments also applied here because the definitions are copied over.

4.
Section: terminology 
I'm wondering why some terms are not copied over from the draft-ietf-psamp-framework-03.txt.
For example, the observation point which is referenced more than once in the draft.
For example, the oberved packet stream which is quite essential but never referred to in this draft
    (see one of my remark below about it)
etc...
So why not copy over the entire section?

5.
  Content-independent Sampling: a sampling operation that does not use 
  packet content (or quantities derived from it) as the basis for 
  selection is called a content-independent sampling operation. 
  Examples include systematic sampling, and uniform pseudorandom 
  sampling driven by a pseudorandom number whose generation is 
  independent of packet content. Note that independent sampling a does 
  not need to access the packet content in order to make the selection 
  decision. 
independent sampling a -> content-independent sampling

6.
  Hash selection range: a subset of the hash range. The packet is 
  selected if the action of the hash function on the hash domain for 
  the packet yields a result in the hash selection range. 

It was one of my remark regarding draft-ietf-psamp-framework-03.txt, who speaks about selection range.
Good that it's already changed in here.

7. 
  Metering process: see the definition in [QuZC03]  

I like this definition that is one more connection with IPFIX.
But draft-ietf-psamp-framework-03.txt speaks of measurement process.
draft-ietf-psamp-framework-03.txt should be changed to refer to the metering process

8.
Section: Scope and Deployment of Packet Selection Techniques 
 
  The function of packet selection is to select a subset from the 
  stream of all packets visible at an observation point. Selection can 
  be used to select packets based on their content, and/or to reduce 
  the rate of packet reports regardless of content.  

Should become
  The function of packet selection is to select a subset from the 
  Observed Packet Stream at an observation point. Selection can 
  be used to select packets based on their content, and/or to reduce 
  the rate of packet reports regardless of content. 

Note: I put the upper case because it's a definition. I think it should be the same for all definitions

9.
Section: Scope and Deployment of Packet Selection Techniques 

  The selection technique used to select a subset of packets out of all 
  those crossing an observation point depends on the purpose 
  (application) for which measurement is performed.

Should become
  The selection technique used to select a subset of packets out of the 
  Observed Packet Stream at the observation point depends on the purpose 
  (application) for which measurement is performed.

10.
Section: Scope and Deployment of Packet Selection Techniques 

  The selection technique used to select a subset of packets out of all 
  those crossing an observation point depends on the purpose 
  (application) for which measurement is performed. If the main purpose 
  of an application is to infer some characteristic of the whole set of 
  crossing packets without processing them all (thus reducing the 
  computation load) then we call the used selection technique 
  ôsamplingö.

You have many of these "quote" issues in your draft. This is due to word that insert special characters
Solution: cut and paste the quotes from the "Status of this Memo", these are OK.

11.
Section: Scope and Deployment of Packet Selection Techniques 

  Note that a common technique to select packets is to compute a hash 
  function on some bits of the packet header and/or content and to 
  select it if the result falls in a certain selection range. Since 
  hashing is a deterministic operation, it is a powerful mean to ensure 
  that the same packets are selected at multiple measurement points. 
  Depending on the chosen input bits, on the hash function and on the 
  selection range, this technique could also be used to emulate the 
  random selection of packets with a given probability p. Hashing is 
  then a particular type of filtering, but can also be used to emulate 
  random sampling. 

I would rewrite this with the terminology section in mind: hash-based 
selection, hash domain, hash range, hash function, hash selection range

Something like
  Note that a common technique to select packets is to compute a Fash 
  Function on the Hash Domain (some bits of the packet header and/or content) and to 
  select it if the Hash Range falls in the Hash Selection Range. Since 
  hashing is a deterministic operation, it is a powerful mean to ensure 
  that the same packets are selected at multiple measurement points. 
  Depending on the chosen input bits of the Hash Domain, on the Hash Function and on the 
  Hash Selection Range, the Hash-based selection could also be used to emulate the 
  random selection of packets with a given probability p. Hashing is 
  then a particular type of filtering, but can also be used to emulate 
  random sampling. 

12.
Section: Scope and Deployment of Packet Selection Techniques 

  The introduced classification is mainly useful for the definition of 
  an information model describing ôprimitiveö selection techniques. 

The selector defined in draft-ietf-psamp-framework-03.txt should be reused

13.
Section: Scope and Deployment of Packet Selection Techniques 

  We consider packet selectors as part of an IPFIX metering process 
  which also can use the IPFIX exporting process. This is expressed as 
  association to one or more IPFIX processes. 

I think this notion above is essential but shouldn't it be part of the framework draft 
draft-ietf-psamp-framework-03.txt instead of this draft?

14.
Section: Scope and Deployment of Packet Selection Techniques 
  Sampling Methods can be characterized by the sampling algorithm, the 
  trigger type used for starting a sampling interval and the length of 
  the sampling interval. These parameters are described here in detail. 
  The sampling algorithm describes the basic process for selection of 
  samples. In accordance to [AmCa89] and [ClPB93] we define the 
  following basic sampling processes: 

methods in lower case

15.
Section: 3.1.1  Systematic Sampling  
  The use of systematic sampling always involves the risk of biasing 
  the results. If the systematics in the sampling process resembles 
  systematics in the observed stochastic process (occurrence of the 
  characteristic of interest in the network), there is a high 
  probability that the estimation will be biased. Systematics (e.g. 
  periodic repetition of an event) in the observed process might not be 
  known in advance.  

Should become
  The use of systematic sampling always involves the risk of biasing 
  the results. If the systematics (e.g. 
  periodic repetition of an event) in the sampling process resemble 
  systematics in the observed stochastic process (occurrence of the 
  characteristic of interest in the network), there is a high 
  probability that the estimation will be biased. Systematics in the observed process might not be 
  known in advance. 

16.
Section: 3.1.2.2.1      Uniform Probabilistic Sampling 
   
  For Uniform Random Sampling packets are selected independently with 
  some uniform probability 1/N. This sampling can be count-driven, and 
  is sometimes referred to as geometric random sampling, since the 
  difference in count between successive selected packets are 
  independent random variables with a geometric distribution of mean N. 
  A time-driven analog, exponential random  sampling, has the time 
  between triggers exponentially distributed. 
  Both geometric and exponential random sampling are examples of what 
  is known as additive random sampling, defined as sampling where the 
  intervals or counts between successive samples are independent 
  identically distributed random variable. 

Uniform Random Sampling -> Uniform Probabilistic Sampling

17.
Section: 3.1.2.2.2      Non-Uniform Probabilistic Sampling 
   
  Also known as non-uniform probability sampling, this is a variant of 
  independent random sampling in which the sampling probabilities can 
  depend on the selection process input. This can be used to weight 
  sampling probabilities in order e.g. to boost the chance of sampling 
  packets that are rare but are deemed important. Unbiased estimators 
  for quantitative statistics are recovered by renormalization of 
  sample values; see [HT52]. 

"Also known as non-uniform probability sampling", not sure it's necessary ;)
"a variant of independent random sampling" -> not defined before
shouldn't it be "uniform probabilistic sampling"?

18.
Section: 3.1.2.2.3      Non-Uniform flow State dependent sampling 

  Another type of sampling that can be classified as Non-Uniform (and, 
  possibly, probabilistic) is closely related to the flow concept as 
  defined in [QuZC02], ...

I don't understand "(and, possibly, probabilistic)"  because we are already under the probabilistic sampling chapter 3.1.2.2

19. 
We wrote in the terminology section:
selection based on packet content = filtering.

But in section 3.1.2.2.3, we also wrote 
  This type of sampling is also 
  content dependent because the identification of the flow the packet 
  belongs to requires analyzing part of the packet content.

And 
  n-out-of-N sampling and uniform probabilistic sampling are contentû
  independent selection schemes. For non-uniform probabilistic sampling 
  the sampling probability can be based on packet content. 

I would create a new small section "3.1.3 sampling and packet content", that would explain something like this:
The terminolgy sections defines:
  Filtering: a filter is a selection operation that selects a packet 
  deterministically based on the packet content, its treatment, and 
  functions of these occurring in the selection state. Examples include 
  match/mask filtering, and hash-based selection. 
   
  Sampling: a selection operation that is not a filter is called a 
  sampling operation. 

We can deduce that not a single sampling selection can be based on the packet content.
Nevertheless, for the more advanced sampling selections, the distinction between sampling and filtering is becoming subtle.
And some selection operations classified as sampling could in reality be based on packet content.
These shoud anyway be considered as exceptions.
The table below summarizes the behavior of the different sampling operations
                                            
                                     |  content-independent  |  content-dependent      Sampling Scheme                |       sampling        |      sampling       --------------------------------+-----------------------+--------------------      systematic sampling:           |                       |        count-based                    |           X           |       --------------------------------+-----------------------+--------------------      systematic sampling:           |                       |          time-based                     |           X           |      --------------------------------+-----------------------+--------------------      random sampling:               |                       |         n-out-of-N                     |           X           |       --------------------------------+-----------------------+--------------------      random, probabilitic sampling: |                       |        uniform probabilistic          |           X           |         --------------------------------+-----------------------+--------------------          random, probabilitic sampling: |                       |        non-uniform probabilistic      |                       |          X      --------------------------------+-----------------------+--------------------      random, probabilitic sampling: |                       |        non-uniform flow-state         |                       |          X      --------------------------------+-----------------------+---------------------  
Note: I'm almost sure that the table will not be formatted in the correct way, so I attached a version in word.
This word document contains 2 tables. The second one is the table of section 5 where the terminology has been slightly modified.

Also in the Section: Scope and Deployment of Packet Selection Techniques 

  The selection technique used to select a subset of packets out of all 
  those crossing an observation point depends on the purpose 
  (application) for which measurement is performed. If the main purpose 
  of an application is to infer some characteristic of the whole set of 
  crossing packets without processing them all (thus reducing the 
  computation load) then we call the used selection technique 
  ôsamplingö. In principle, with sampling the content of the packet is 
  not relevant for the packet selection: what matters is only that the 
  selected sample has a distribution of the characteristic to infer 
  similar to the one of the parent population, so that it can be 
  estimated reliably. The sampling decision may be based on the 
  temporal or spatial position of the packet in the packet stream, or 
  may depend on a (pseudo) random number extraction or calculation.

I would add a reference to the new section.
  In principle, with sampling the content of the packet is 
  not relevant for the packet selection (see section 3.1.3 sampling and packet content): ...


20.
Some other issues with Word
   
  (i.e. thereÆs no room to keep all the flows that have been scheduled 
  for monitoring). 

  This type of filtering selects a packet operating as follows: first a 
  combination of packetÆs bit positions is selected taking the logical 
  AND of portion of the packetÆs bits and a mask, then itÆs checked if 

21.
Section:  4.2 Hashing filtering 
   
  A hash function h maps the packet content c, or some portion of it, 
  onto a range R. The packet is selected if h(c) is an element of S, 
  which is a subset of R called the ôselection rangeö. Thus hash-based 
  sampling is indeed a particular case of filtering: the object is 
  selected if c is in inv(h(S)). But for desirable hash functions the 
  inverse image inv(h(S)) will be extremely complex, and hence h would 
  not be expressible as, say, a match/mask filter or a simple 
  combination of these. 

Like in my remark 11, it would be better to rewrite it with the terminology in mind: 
hash-based selection, hash domain, hash range, hash function, hash selection range

22.
Section: 4.2.1 Random sampling emulation 
   
  Although pseudorandom number generators with well understood 
  properties have been developed, they may not be the method of choice 
  in setting where computational resources are scarce. A convenient 
  alternative is to use hash functions of packet content as a source of 
  randomness. The hash (suitably renormalized) is a pseudorandom 
  variate in the interval [0,1]. Other schemes may use packet fields in 
  iterators for pseudorandom numbers. 
  The point here, is that the statistical properties of the idealized 
  packet selection law (such as independence of sampling decisions for 
  different packets, or independence on packet content) may not be 
  exactly shared by an implementation, but only approximately so. 
  Although the selection decisions for non-uniform independent random 
  sampling (see Section 3.1.2.2.2 above) also depend on the packet 
  content, this form of sampling is distinguished from the use of 
  packet content to generate variates. In the former case, the content 
  only determines the selection probabilities: selection could then 
  proceed e.g by use of a variates obtained by an independent 
  pseudorandom number generator. In the latter case, the content 
  determines the pseudorandom variates rather than the probabilities. 

non-uniform independent random sampling -> I guess this should be non-uniform probabilistic sampling 

23.
Section: 4.2.2 Consistent packet selection and its applications 

Isn't it covered already in section 10.2 from the framework draft?

24.
Section: 4.2.3  Guarding Against Pitfalls and Vulnerabilities 
   
  Hash sampling could be overloaded (or evaded) by an attacker if the 
  hash function and the selection rate are both known. A service 
  provider could build a first defense keeping S private. Then, an 
  attacker could not determine whether a crafted packet is select, but 
  he would still know that a crafted a set of packets all with the same 
  hash is either all selected or all not selected. Moreover, when 
  applications (like multi domain trajectory sampling, or One way delay 
  estimation across multiple domains) may require multiple 
  administrative entities to agree on a common hash function and 
  selection range, mutual trust between the entities cannot be avoided 
  and just keeping S secret may not be feasible. A stronger defense is 
  to employ a parametrizable hash function and keep the parameter 
  private: in this case, the set of hash values of the packets could 
  not be determined. Examples of parameters are the initial vector in 
  CRC32, and moduli in hashes based on modular arithmetic. 

S should be referred to the Hash Selection Range S
Typo for the 2 other underlined parts

25.
Section: 4.3 Router State Filtering

See my comments related to this section in my email with subject "comments on
draft-ietf-psamp-framework-03.txt" sent on september 18th.

26.
Section: 5 Input Parameters and Information Models

Look at the second table in the attached document. Some minor changes to the terminology.

27.
Section: 5.1 Information Model for Sampling Techniques 
   
  SELECTOR_TYPE 
  Description: For sampling processes the SELECTOR TYPE defines what 
  sampling algorithm is used. 
  Values: n out of N | Systematic Time Based (equally spaced)| 
  Systematic Position Based (equally spaced)| Probabilistic | flow 
  state  

All selection operations should be in there.

28.
Section: 5.1 Information Model for Sampling Techniques 

  SELECTOR_PARAMETERS 
  Description: For sampling processes the SELECTOR PARAMETERS define 
  the input parameters for the process. Interval length in systematic 
  sampling means, that all packets that arrive in this interval are 
  selected. The spacing parameter defines the spacing in time or number 
  of packets between the end of one sampling interval and the start of 
  the next succeeding interval. 
 
  Case n out of N: 
     - List of n sampling positions in an array of N positions 
   
  Case Systematic Time Based: 
     - Interval length (in usec), Spacing (in usec) 
   
  Case Systematic Count Based: 
     - Interval length(in packets), Spacing (in packets) 
   
  Case uniform Probabilistic(with equal probability per packet): 
     - Sampling probability p 
      
  Case non-uniform probabilistic: 
     - Calculation function for sampling probability p 
   
  Case non-uniform flow state: 
     - Policy for selecting flows (e.g. give priority to large flows) 


List of n sampling positions in an array of N positions:
What if we use random numbers? Exporting all random number (or the positions) doesn't make sense!
And with the random number of the positions, one could try to reverse engineer the function...
I think we must just export n and N and assume a good random number generation function!

Minor detail, I would keep the selection operation order as defined in the table of content 

29.
Section: 5.1 Information Model for Sampling Techniques 

  OPERATING_TIME 
  Description: The OPERATING_TIME parameter describes the start/stop 
  time of sampling process. List elements must not overlap. The start 
  time of the first element can be omitted, meaning ôfrom nowö. The end 
  time of the last element can be omitted, meaning ôuntil sampler is 
  removedö.
  Values: List of (Start time, End time)  

Why are these values interesting to report?
Unless you want those for configuration, i.e. I want to enable this sampling function for 10 minutes starting tomorrow at noon.
I'm not sure this is interesting!

30.
Section: 5.1 Information Model for Sampling Techniques 

  ASSOCIATIONS 
  Description: The ASSOCIATIONS field describes the observation point 
  and the IPFIX processes to which the packet selector is associated.  
  The STREAM ID denotes the origin of the data stream that is input to 
  the selection function. It can be the observation point directly or 
  the ID of another selector. With this it is possible to define   
  combined schemes. If the STREAM ID contains IDs from other selectors, 
  one can derive the original observation point from the selector 
  definitions of these specified selectors. 
   
  Values: <STREAM ID, Metering process ID, Exporting process ID> 
  With STREAM ID: Observation point ID | List of SELECTOR_IDs 

The STREAM ID is composed of the Observation Point OR List of Selector ID.
This should be: the Observation Point AND List of Selector ID.
There is no point to know that such sampling is applied with such parameters if we don't know on which interface is applied this function.



That's it for now regarding my comments.

Regards, Benoit.



Attachment: tables_for_sampling_techniques_02.doc
Description: MS-Word document