[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)



Hi Marcelo,

	My comments are in-line... sorry for the late reply, but I've been 
traveling too much lately...

:: El 01/03/2006, a las 10:10, Igor Gashinsky escribió:
:: So the effort for this case imho is putted in enabling the capacity or
:: establishing new sessions after an outage rather than in preserving
:: established connections, do you think this makes any sense to you

This makes a lot of sense, provided this happens under the hood of the 
application (ie web-browser in this case). So, right now, for example, if 
a client is pulling down a web page, gets the html, and and in the middle 
of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that 
the client was in the middle of x-fering will get that ugly red "X". 
(most browsers, right now, will not re-try to get the object again, and 
will just show it as unavailable). This issue is deemed important enough 
that most large content providers are spending an inordinate amount of 
money on loadbalancer with active session sync to try to prevent that 
from happening in the even of a loadbalancer fail-over. So, if 
application behavior could be changed to say "if shim6 fail-over is 
possible, and connection just died (for any definition of die), then 
attempt re-establish connection through the shim, and then to re-get the 
failed object", that would go a long way in making this kind of fail-over 
better. 

The difference with shim6, instead of v4 is that in v4 world, the 
connection wouldn't die, it would just hang for the duration of 
convergence (provided  convergence is fast enough, which normally it is), 
and then continue on it's merry way with new tcp windows. In Shim6, if 
the client[ip1]-server connection goes down, re-establishing to 
client[ip2]-server would not be "hitless" (ie session would die), and to 
solve that problem we are back at either keeping an inordinate amount of 
state on the webservers (which is not very realistic), a shift in the 
way people write applications (which, in my opinion is preferred, but a 
*very* hard problem to solve), or to somehow figure out how to hide this 
in the stack with minimal performance hit (let's say sub 1% memory hit) 
when you have 30k+ simultaneous connections per server...

:: > 3) While TE has been discussed at length already, but it is something
:: > which is absolutely required for a content provider to deploy shim6. There
:: > has been quite a bit of talk about what TE is used for, but it seems that
:: > few people recognize it as a way of expressing "business/financial
:: > policies". For example, in the v4 world, the (multi-homed) end-user maybe
:: > visible via both a *paid* Transit path (say UUNET), and a *free* peering
:: > link (say Cogent), and I would wager that most content providers would
:: > choose the free link (even if performance on that link is (not hugely)
:: > worse). That capability all but disappears in the v6 world if the Client
:: > ID was sourced from their UUnet ip address (since that's who they chose
:: > to use for outbound traffic), and the (web) server does not know that
:: > that locator also corresponds to a Cogent IP (which they can reach for
:: > free).
:: 
:: I fail to understand the example the you are presenting here...
:: 
:: are you considering the case where both the client and the server are both
:: multihomed to Cognet and UUnet?
:: something like
:: 
:: UUnet
:: /     \
:: C       S
:: \     /
:: Cognet

Yes, but now imagine the the "C" in this case is a client using shim6 with 
multiple IP's, and the server is in IPv6 PI space. Also, if it wasn't in 
PI space, the connection to the server *can* be influenced via SRV (although 
that's trying to shoehorn DNS into where perhaps it shouldn't go -- since 
now the DNS server needs to be aware of link-state in the network to 
determine if the UUnet/Cogent connections are even up, and for a 
sufficiently large "S", that could be 10's, or even 100's of links, which 
presents a very interesting scaling problem for DNS.. (even more 
interesting is that most large content providers are actually in the 
1000's, and that's why they can get PI space -- they are effectively (at 
least) a tier-2 ISP). But, back to the example at hand.. so, for the 
sake of this example, let's say that the UUnet port is $20/Mbps, and 
the Cogent port is a SFI (free) peer. So, the client (with ips of 
IP-uunet and IP-cogent) picks IP-uunet (because they want to use their 
UUnet connection outbound) to initiate a connection to the server, the 
problem now comes from the fact that the server, when replying to the 
client is unaware that IP-cogent IP is associated with the client, 
(since the shim layer has not kicked in on initial connect) and will have 
to send traffic through the very expensive UUnet port. With v4, on the 
other hand, the router was aware that Client is reachable via both Cogent 
and UUnet, and could have had a localpref configured that would just say 
"anything reachable over cogent, use cogent". One way to fix that would be 
to do a shim6 init in the 3way handshake, but the problem then becomes 
that *every* "S" would have to have a complete routing table, and 
basically, perform the logic that is done in today's routers. Obviously 
running Zebra w/ full routes on a server is a non-trivial performance hit, 
and multiplied that out by the number of servers, and it gets 
very expensive, very fast. All to re-gain capabilities we have right now 
in ipv4 for free...

Now, of course, the "so called easy" answer would be "let's introduce a 
routing policy middleware box that would handle that part". That box would 
have the full routing tables, the site policies, and when queried with 
"I'm server X, and this is the client and all his locators, which one do I 
use?" would spit back an answer to that server that would be a fully 
informed decision, and the TE problem becomes mostly solved. I say 
"mostly", because now there are these pesky issues of a) do I trust that 
the server is going to obey by this decision (either hacked, or is a box 
outside of my administrative control, yet is within the scope of my 
network control); b) how do transit ISP's "influence" that decision (at 
some point I cross their network, and they should be able to control how 
the packets are flowing through their network; c) how do I verify that 
their "influencing" doesn't negate mine, and is legitimate; d) how much 
"lag" does it introduce into every session establishment, and is it 
acceptable; d) can this proxy scale to the number of queries fired at it, 
and the real-time computations that would have to happen on each one 
(since we can't precompute the answers); and finally is it *really* more 
cost-effective then doing all this in routers. 

So far, I'd rather pay for bigger routers...

:: I mean in this case, the selection of the server provider is determined by
:: the server's address not by the client address, right?
:: The server can influence such decision using SRV records in the DNS, but not
:: sure yet if this is the case you are considering

See above about difficulties of scaling DNS to meet this goal...

:: > This change alone would add millions to the bw bills of said
:: > content providers, and well, reduce the likelyhood of adoption of the
:: > protocol by them. Now, if the shim6 init takes place in the 3way
:: > handshake process, then the servers "somewhat" know what all possible
:: > paths to reach that locator are, but then would need some sort of a
:: > policy server telling them who to talk to on what ip, and that's something
:: > which will not simply scale for 100K+ machines.
:: > 
:: 
:: I am not sure i understand the scaling problem here
:: Suppose that you are using a DHCP option for distributing the SHIM6
:: preferences of the RFC3484 policy table, are you saying that DHCP does not
:: scale for 100K+ machines? or is there something else other than DHCP that

Well, first, show me a content provider who thinks that dhcp scales for a 
datacenter (other then initial pxeboot/kickstart/jumpstart, whatever), but 
that aside, running zebra/quagga + synchronizing policy updates among 
100K+ machines simply does not scale (operationally).

:: > 4) As has also been discussed before, the initial connect time has to be
:: > *very* low. Anything that takes longer then 4-5 seconds the end-users have
:: > a funny way of clicking "stop" in their browser, deeming that "X is down,
:: > let me try Y", which is usually not a very acceptable scenario :-) So,
:: > whatever methodology we use to do the initial set-up has to account for
:: > that, and be able to get a connection that is actually starting to do
:: > something in under 2 seconds, along with figuring out which sourceIP and
:: > destIP pairs actually can talk to each other.
:: 
:: As i mentioned above, we are working in other mechanisms than the shim6
:: protocol itself that can be used for establishing new communication through
:: outages.
:: 
:: you can find some work in this area in
:: 
:: ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-bagnulo-ipv6-
:: rfc3484-update-00.txt

It's a fairly good idea of negotiating which SRC and DEST ips to pick, but 
it has to happen *fast* (ie sub 2 seconds), or the end-users will lose 
patience, and declare the site dead. Perhaps racing SYNs?

Now, I'm not saying that all these problems can't be solved for people to 
consider shim6 a viable solution, but so far, they aren't solved, and 
until they are, I just don't see recommending to my employer to take shim6 
seriously, since it seems like all it's going to do is to move the costs 
elsewhere, and quite possibly increase them quite a bit in the process... 

-igor