[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)

To: Igor Gashinsky <igor@gashinsky.net>
Subject: Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
From: marcelo bagnulo braun <marcelo@it.uc3m.es>
Date: Wed, 8 Mar 2006 14:08:49 +0200
Cc: shim6-wg <shim6@psg.com>
In-reply-to: <Pine.LNX.4.60.0603020515520.11230@moonbase.nullrouteit.net>
References: <Pine.GSO.4.20.0602270202480.18350-100000@meno.corp.us.uu.net> <Pine.LNX.4.60.0602280209190.28719@moonbase.nullrouteit.net> <3df7df0affe008c937485331b35c23ed@it.uc3m.es> <Pine.LNX.4.60.0603020515520.11230@moonbase.nullrouteit.net>

El 08/03/2006, a las 10:28, Igor Gashinsky escribió:

Hi Marcelo,

	My comments are in-line... sorry for the late reply, but I've been
traveling too much lately...

:: El 01/03/2006, a las 10:10, Igor Gashinsky escribió:
:: So the effort for this case imho is putted in enabling the capacityor
:: establishing new sessions after an outage rather than in preserving
:: established connections, do you think this makes any sense to you

This makes a lot of sense, provided this happens under the hood of the
application (ie web-browser in this case). So, right now, for example,ifa client is pulling down a web page, gets the html, and and in themiddle
of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that
the client was in the middle of x-fering will get that ugly red "X".
(most browsers, right now, will not re-try to get the object again, and
will just show it as unavailable). This issue is deemed importantenough
that most large content providers are spending an inordinate amount of
money on loadbalancer with active session sync to try to prevent that
from happening in the even of a loadbalancer fail-over. So, if
application behavior could be changed to say "if shim6 fail-over is
possible, and connection just died (for any definition of die), then
attempt re-establish connection through the shim, and then to re-getthefailed object", that would go a long way in making this kind offail-over
better.

This is possible with the shim6 protocol, since it supports unreachableulids when establishing the shim context, so i guess this would be ok.Probably a couple of elements that are needed, like an extended API toallow the apps to tell this to the shim (you probably want also toinform the shim which locator is not working), and the shim needs toremember the alternative locators obtained from the DNS even if thereis not shim context yet, in order to have a clue about whichalternative address to use (the other other option is to perform areverse lookup for retrieving those... see the threat with Erik formore about this point). But in any case, i think all this issues areeasily solvable

but, i have an additional question about this point. the point is, ifthe application is the one that will determine that there is a problemand will ask the shim to establish a context (which is ok and noproblem here) wouldn't the application be better off simply retryingwith alternative locators by itself, rather then asking the shim to doit?

The difference with shim6, instead of v4 is that in v4 world, the
connection wouldn't die, it would just hang for the duration of
convergence (provided convergence is fast enough, which normally itis),
and then continue on it's merry way with new tcp windows. In Shim6, if
the client[ip1]-server connection goes down, re-establishing to
client[ip2]-server would not be "hitless" (ie session would die), andtosolve that problem we are back at either keeping an inordinate amountof
state on the webservers (which is not very realistic), a shift in the
way people write applications (which, in my opinion is preferred, but a
*very* hard problem to solve), or to somehow figure out how to hidethis
in the stack with minimal performance hit (let's say sub 1% memory hit)
when you have 30k+ simultaneous connections per server...

well if you use the shim approach that you suggest above, the serverdoes not have to store any shim state while things are doing fine andif a client detects a problem it can trigger the creation of the shimcontext from the client to the server. At this point, the server willneed some shim state, but only for those connections that have failed(of course if one of the links to the server went down, then all theclients connecting through that link will attempt to create a shimstate)

I guess that this could be a reasonable trade-off between state in theserver and response time when outages occur

:: > 3) While TE has been discussed at length already, but it issomething:: > which is absolutely required for a content provider to deployshim6. There:: > has been quite a bit of talk about what TE is used for, but itseems that
:: > few people recognize it as a way of expressing "business/financial
:: > policies". For example, in the v4 world, the (multi-homed)end-user maybe:: > visible via both a *paid* Transit path (say UUNET), and a *free*peering:: > link (say Cogent), and I would wager that most content providerswould:: > choose the free link (even if performance on that link is (nothugely):: > worse). That capability all but disappears in the v6 world if theClient:: > ID was sourced from their UUnet ip address (since that's who theychose:: > to use for outbound traffic), and the (web) server does not knowthat:: > that locator also corresponds to a Cogent IP (which they canreach for
:: > free).
::
:: I fail to understand the example the you are presenting here...
::
:: are you considering the case where both the client and the serverare both
:: multihomed to Cognet and UUnet?
:: something like
::
:: UUnet
:: /     \
:: C       S
:: \     /
:: Cognet
Yes, but now imagine the the "C" in this case is a client using shim6withmultiple IP's, and the server is in IPv6 PI space. Also, if it wasn'tinPI space, the connection to the server *can* be influenced via SRV(althoughthat's trying to shoehorn DNS into where perhaps it shouldn't go --since
now the DNS server needs to be aware of link-state in the network to
determine if the UUnet/Cogent connections are even up, and for a
sufficiently large "S", that could be 10's, or even 100's of links,which
presents a very interesting scaling problem for DNS.. (even more
interesting is that most large content providers are actually in the
1000's, and that's why they can get PI space -- they are effectively(at
least) a tier-2 ISP). But, back to the example at hand.. so, for the
sake of this example, let's say that the UUnet port is $20/Mbps, and
the Cogent port is a SFI (free) peer. So, the client (with ips of
IP-uunet and IP-cogent) picks IP-uunet (because they want to use their
UUnet connection outbound) to initiate a connection to the server, the
problem now comes from the fact that the server, when replying to the
client is unaware that IP-cogent IP is associated with the client,
(since the shim layer has not kicked in on initial connect) and willhave
to send traffic through the very expensive UUnet port.

that i don't follow

suppose that the server has v6 PI addresses, which for very big sitesmakes sense imho

The server can send traffic with destiantion address belonging to UUNetthrough Cognet, right? I mean i am assuming that UUNet and Cognet haveconnectivity that is not through S

I mean, The client can choose to use the IP from UUNet (that is hischoice and he has the right to do so, because he is paying for it) Thischoice, affects the ISP used to get _to_ the client and it shouldn'tdetermine the ISP used to get to the Server

So in this case the traffic would flow:
From the client to the Internet through UUNet
From the internet to the Server through Cognet

agree?

Now the problem is when the server also has PA blocks

In this case, the destiantion address selected by the client willdetermine the ISP of the server

Without shim, the server don't have many options, basically what hecould do is to use the DNS to prioritize the Cognet addresses.With the shim, the server can rehome any communication that is usingUUnet addresses to Cognet and start using Cognet locators. This ofcourse does not prevent the client to keep on using the UUnetdestination addresses. In this case, the server can inform the clientabout his preferences using a shim protocol option, but even in thiscase the client can prefer other than what is expressed by S in thepreferences. In any case, in this model, each can always choose thepath used to send packets. I guess that in IPv4 is somehow differentbecause the decision belongs to the intermediate ASes, which are toones that can select which path to use (note that in this case, is notS who is in charge to select the incoming path neither)

 With v4, on the
other hand, the router was aware that Client is reachable via bothCogentand UUnet, and could have had a localpref configured that would justsay"anything reachable over cogent, use cogent". One way to fix thatwould be
to do a shim6 init in the 3way handshake, but the problem then becomes
that *every* "S" would have to have a complete routing table, and
basically, perform the logic that is done in today's routers.

why is that?

I mean if S prefers Cognet, all he has to do is:

- In the PI case, route its outgoing packet through Cognet and do thesame v4 bgp magic to direct incoming packet through cognet- In the PA case, always use cognet addresses and try to convince theclients to use the server's IP address of the cognet prefix (throughSRV and/or shim preferences)

Obviously
running Zebra w/ full routes on a server is a non-trivial performancehit,
and multiplied that out by the number of servers, and it gets
very expensive, very fast. All to re-gain capabilities we have rightnow
in ipv4 for free...

Now, of course, the "so called easy" answer would be "let's introduce a
routing policy middleware box that would handle that part". That boxwould
have the full routing tables, the site policies, and when queried with
"I'm server X, and this is the client and all his locators, which onedo I
use?" would spit back an answer to that server that would be a fully
informed decision, and the TE problem becomes mostly solved. I say

but there seems to be two different problems here (at least :-)

- one: which are the TE capabilities available with the PA addressingmodel + the shim tool. This is what this can be done in this case.- second: who is in control of these capabilties and how are theymanaged i.e. who controls the policy and who manages the devices thatare in control of the policy. Is it possible to have a centralizedpolicy management? is it possible to enforce the usage of policy (atleast within the multihomed site)?

I guess that before we were considering the problem one and now thesecond one...

This server idea that you are considering was presented by Cedric deLaunois in a work called NAROS a while ago

The other option is what we are discussing below about using aDHCP/RAdv option to distribute the policy information among the hosts

The other option is to move to a scheme based on rewriting sourceprefixes

Or a combination of those

"mostly", because now there are these pesky issues of a) do I trustthatthe server is going to obey by this decision (either hacked, or is abox
outside of my administrative control, yet is within the scope of my
network control); b) how do transit ISP's "influence" that decision (at
some point I cross their network, and they should be able to controlhow
the packets are flowing through their network; c) how do I verify that
their "influencing" doesn't negate mine, and is legitimate; d) how much
"lag" does it introduce into every session establishment, and is it
acceptable; d) can this proxy scale to the number of queries fired atit,
and the real-time computations that would have to happen on each one
(since we can't precompute the answers); and finally is it *really*more
cost-effective then doing all this in routers.

So far, I'd rather pay for bigger routers...
:: I mean in this case, the selection of the server provider isdetermined by
:: the server's address not by the client address, right?
:: The server can influence such decision using SRV records in theDNS, but not
:: sure yet if this is the case you are considering

See above about difficulties of scaling DNS to meet this goal...

but the problem with the DNS that you have considered above is abouthaving to achieve that the DNS publish information that reflect thestate of the links.This seems indeed very dificult, especially becuase of cachedinformation and so on. But as far as i know, no one is proposing this.The idea is to use SRV records to express policy, and a not verydynamic one. I mean, you can express that like 30% of thecommunications needs to use a given address and the others the otheraddress and so on, but the idea is not to allow the DNS to reflect thestate of the networkActually, it may happen that some of the addresses in the DNS are down.In this case, the idea is to let the hosts to detect this an retryusing alternative addresses. whether this retrial is visible or not tothe apps, is still an open issue

:: > This change alone would add millions to the bw bills of said
:: > content providers, and well, reduce the likelyhood of adoption ofthe
:: > protocol by them. Now, if the shim6 init takes place in the 3way
:: > handshake process, then the servers "somewhat" know what allpossible:: > paths to reach that locator are, but then would need some sort ofa:: > policy server telling them who to talk to on what ip, and that'ssomething
:: > which will not simply scale for 100K+ machines.
:: >
::
:: I am not sure i understand the scaling problem here
:: Suppose that you are using a DHCP option for distributing the SHIM6
:: preferences of the RFC3484 policy table, are you saying that DHCPdoes not:: scale for 100K+ machines? or is there something else other thanDHCP that
Well, first, show me a content provider who thinks that dhcp scalesfor adatacenter (other then initial pxeboot/kickstart/jumpstart, whatever),but
that aside, running zebra/quagga + synchronizing policy updates among
100K+ machines simply does not scale (operationally).

So, you are considering here the case where policy is changed accordingto the state of the network, right?So that BGP information is used as feedback to the TE decision, is thatcorrect?Is this possible today? how is it done? could you provide an example ofhow you use this dynamic TE setting?

:: > 4) As has also been discussed before, the initial connect timehas to be:: > *very* low. Anything that takes longer then 4-5 seconds theend-users have:: > a funny way of clicking "stop" in their browser, deeming that "Xis down,:: > let me try Y", which is usually not a very acceptable scenario:-) So,:: > whatever methodology we use to do the initial set-up has toaccount for:: > that, and be able to get a connection that is actually startingto do:: > something in under 2 seconds, along with figuring out whichsourceIP and
:: > destIP pairs actually can talk to each other.
::
:: As i mentioned above, we are working in other mechanisms than theshim6:: protocol itself that can be used for establishing new communicationthrough
:: outages.
::
:: you can find some work in this area in
::
::ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-bagnulo-ipv6-
:: rfc3484-update-00.txt
It's a fairly good idea of negotiating which SRC and DEST ips to pick,but
it has to happen *fast* (ie sub 2 seconds), or the end-users will lose
patience, and declare the site dead. Perhaps racing SYNs?

yes, this is an option and it is nice because you actually get not onlyto detect which ones are actually working but also to pick the fastestone. But clearly there is the cost of the additional SYNs you send thatis basically overhead... would you be willing to pay for this multipleSYNs?

Now, I'm not saying that all these problems can't be solved for peopleto
consider shim6 a viable solution, but so far, they aren't solved, and
until they are, I just don't see recommending to my employer to takeshim6
seriously,

I may well agree with you here, but remeber that we are still definingthe protocol :-)

I guess the point here is how we can manage to provide a solution thatfits the site's requirements, hence your feedback is very valuable

Regards, marcelo

 since it seems like all it's going to do is to move the costs
elsewhere, and quite possibly increase them quite a bit in theprocess...
-igor

Follow-Ups:
- Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
  - From: Igor Gashinsky <igor@gashinsky.net>

References:
- Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
  - From: Igor Gashinsky <igor@gashinsky.net>
- Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
  - From: marcelo bagnulo braun <marcelo@it.uc3m.es>
- Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
  - From: Igor Gashinsky <igor@gashinsky.net>

Prev by Date: Re: [Fwd: I-D ACTION:draft-nordmark-shim6-esd-00.txt]
Next by Date: API for SHIM
Previous by thread: Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
Next by thread: Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
Index(es):
- Date
- Thread