[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shim6 failure recovery after garbage collection



Hi Igor,

El 18/04/2006, a las 4:23, Igor Gashinsky escribió:

:: One of the more comprehensible objections to shim6 that was raised at NANOG :: 35 was from large content providers who currently serve many thousands of :: simultaneous clients through load balancers or other content-aggregation :: devices (the kind of devices which switch connections to origin servers
:: without having to store any locally).
::
:: I don't remember the precise number of simultaneous sessions the devices were
:: intended to be capable of serving, but it was a lot.
::
:: The observation was that with the amount of (server, client) state being held :: on those devices, adding what might be an average of (say) 2x128 bits + misc
:: overhead per session might present scaling difficulties.

A single WSM-6 Foundry SI450 can handle 15M sessions in the state machine. Assuming an overhead of say, 320 bits per session * 15M sessions we come
up with approx 600MB of extra RAM added to those devices (and that's on
the low side). Multiply out that a large content provider would have
*hundreds* of these devices, it's not a small cost (depending on
the vendor, that memory is not general purpose DRAM, and could be *very*
expensive). Now, that is only extra memory to do nothing but hold other
locators.


i am not very familiar with this type of devices, but why do they need to have shim6 state on them?.... i mean, the shim state is required only the end point of the communication, not in any middle box, AFAICT...

On the web server side, it's not uneard of for a single webserver to
handle 10-20k active, concurrent connections, with another 20k or so
being in various *_WAIT states. Adding an extra 40byte per session
overhead per server is really not that bad (800kb of RAM/server),
although I have no idea what that overhead does to the kernel queues...


but the shim6 context is not per connection, but per peer (more precisely per ULID pair).... hence the question is how many different ulid pairs are involved in these connections... I mean AFAIK, each client establishes in general quite more than one TCP connection to the web server to download a page...

The thing is, those numbers are only taking into account holding
*locators*, and when you start talking about holding onto other things
(like, say reachability state, performance (RTT) state) the memory
utilizations starts to increase slightly more, although still manageble
for the servers (but is rapidly getting more and more expensive on the
SLB's).
 When you then start talking about now holding some sort of a TE
state (because TE is a requirement), and you need to add the routing
table into the equation, *now* it's gets down right nasty. 10-20Mb per
server for shim6 overhead is minor, but add in 200+Mb of routing state,
and it's a non-starter.

but you don't need to have the full routing table in the server...

I mean the server need to know its preferences about which locators-pairs it preffers

i guess you can have a very fine granulated policy but i don't think you will need 200.000 preferences set in the server.

Now perhaps want you need is to use the routing table information to decide which locator do you preffer, is that it?

In this case, i agree that having full bgp table may be useful to select which path to use (hence which source address to use to reach a certain destiantion) Morevoer, it is likely that you may need multiple bgp feed, one per available ISP (so that you can select the path through the ISP that is best)

But for this, it is possible to off load the bgp processing to a separate box, like the NAROS solution proposed a while back.



Also, all of this conversation is only talking about memory overhead, what
about other overhead? Would the server have to do any sort of failure
detection, and how many cycles would that consume?

that would depend on the type of traffic...
if the traffic is bidirectional (which in the TCP case, it usually is) and the timers are tuned, then probably there is no need to send keepalives, so no added traffic
but again it depends on the traffic pattern

 Would the server have
to do any sort of path optimization,

not sure what do you mean by this...

 and how many cycles would that
consume? How do I get TE state to all of my 100k+ host?

We have already exchanged some emails about this, about DHCP and so on. But i guess the point is:
- this is not addressed today
- however, we can try to design a mechanism that address this need that fullfills the requirements that you have, whether this is like a central server(s) that download the policy to the hosts or this is something like what is Erik draft about letting the routers to select the exit path and rewrite to source addresses, is up for discussion.

 How many
cycles would all the hosts need to consume if one of my peers
bounces, and now, instead of 10-20 routers processing that, all 100k
of my hosts have to be updated with that information?

but would all the 100k hosts will be actively talking to the same peer? is that a expected scenario? i would say that in general if a peer bounces, then those hosts that have active communications with this peer will need to rehome their communication


Regards, marcelo


 etc...

-igor