[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Shim6 failure recovery after garbage collection
Hi Igor,
El 18/04/2006, a las 4:23, Igor Gashinsky escribió:
:: One of the more comprehensible objections to shim6 that was raised
at NANOG
:: 35 was from large content providers who currently serve many
thousands of
:: simultaneous clients through load balancers or other
content-aggregation
:: devices (the kind of devices which switch connections to origin
servers
:: without having to store any locally).
::
:: I don't remember the precise number of simultaneous sessions the
devices were
:: intended to be capable of serving, but it was a lot.
::
:: The observation was that with the amount of (server, client) state
being held
:: on those devices, adding what might be an average of (say) 2x128
bits + misc
:: overhead per session might present scaling difficulties.
A single WSM-6 Foundry SI450 can handle 15M sessions in the state
machine.
Assuming an overhead of say, 320 bits per session * 15M sessions we
come
up with approx 600MB of extra RAM added to those devices (and that's on
the low side). Multiply out that a large content provider would have
*hundreds* of these devices, it's not a small cost (depending on
the vendor, that memory is not general purpose DRAM, and could be
*very*
expensive). Now, that is only extra memory to do nothing but hold other
locators.
i am not very familiar with this type of devices, but why do they need
to have shim6 state on them?.... i mean, the shim state is required
only the end point of the communication, not in any middle box,
AFAICT...
On the web server side, it's not uneard of for a single webserver to
handle 10-20k active, concurrent connections, with another 20k or so
being in various *_WAIT states. Adding an extra 40byte per session
overhead per server is really not that bad (800kb of RAM/server),
although I have no idea what that overhead does to the kernel queues...
but the shim6 context is not per connection, but per peer (more
precisely per ULID pair).... hence the question is how many different
ulid pairs are involved in these connections... I mean AFAIK, each
client establishes in general quite more than one TCP connection to the
web server to download a page...
The thing is, those numbers are only taking into account holding
*locators*, and when you start talking about holding onto other things
(like, say reachability state, performance (RTT) state) the memory
utilizations starts to increase slightly more, although still manageble
for the servers (but is rapidly getting more and more expensive on the
SLB's).
When you then start talking about now holding some sort of a TE
state (because TE is a requirement), and you need to add the routing
table into the equation, *now* it's gets down right nasty. 10-20Mb per
server for shim6 overhead is minor, but add in 200+Mb of routing state,
and it's a non-starter.
but you don't need to have the full routing table in the server...
I mean the server need to know its preferences about which
locators-pairs it preffers
i guess you can have a very fine granulated policy but i don't think
you will need 200.000 preferences set in the server.
Now perhaps want you need is to use the routing table information to
decide which locator do you preffer, is that it?
In this case, i agree that having full bgp table may be useful to
select which path to use (hence which source address to use to reach a
certain destiantion) Morevoer, it is likely that you may need multiple
bgp feed, one per available ISP (so that you can select the path
through the ISP that is best)
But for this, it is possible to off load the bgp processing to a
separate box, like the NAROS solution proposed a while back.
Also, all of this conversation is only talking about memory overhead,
what
about other overhead? Would the server have to do any sort of failure
detection, and how many cycles would that consume?
that would depend on the type of traffic...
if the traffic is bidirectional (which in the TCP case, it usually is)
and the timers are tuned, then probably there is no need to send
keepalives, so no added traffic
but again it depends on the traffic pattern
Would the server have
to do any sort of path optimization,
not sure what do you mean by this...
and how many cycles would that
consume? How do I get TE state to all of my 100k+ host?
We have already exchanged some emails about this, about DHCP and so on.
But i guess the point is:
- this is not addressed today
- however, we can try to design a mechanism that address this need that
fullfills the requirements that you have, whether this is like a
central server(s) that download the policy to the hosts or this is
something like what is Erik draft about letting the routers to select
the exit path and rewrite to source addresses, is up for discussion.
How many
cycles would all the hosts need to consume if one of my peers
bounces, and now, instead of 10-20 routers processing that, all 100k
of my hosts have to be updated with that information?
but would all the 100k hosts will be actively talking to the same
peer? is that a expected scenario? i would say that in general if a
peer bounces, then those hosts that have active communications with
this peer will need to rehome their communication
Regards, marcelo
etc...
-igor