[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)
El 08/03/2006, a las 10:28, Igor Gashinsky escribió:
Hi Marcelo,
My comments are in-line... sorry for the late reply, but I've been
traveling too much lately...
:: El 01/03/2006, a las 10:10, Igor Gashinsky escribió:
:: So the effort for this case imho is putted in enabling the capacity
or
:: establishing new sessions after an outage rather than in preserving
:: established connections, do you think this makes any sense to you
This makes a lot of sense, provided this happens under the hood of the
application (ie web-browser in this case). So, right now, for example,
if
a client is pulling down a web page, gets the html, and and in the
middle
of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that
the client was in the middle of x-fering will get that ugly red "X".
(most browsers, right now, will not re-try to get the object again, and
will just show it as unavailable). This issue is deemed important
enough
that most large content providers are spending an inordinate amount of
money on loadbalancer with active session sync to try to prevent that
from happening in the even of a loadbalancer fail-over. So, if
application behavior could be changed to say "if shim6 fail-over is
possible, and connection just died (for any definition of die), then
attempt re-establish connection through the shim, and then to re-get
the
failed object", that would go a long way in making this kind of
fail-over
better.
This is possible with the shim6 protocol, since it supports unreachable
ulids when establishing the shim context, so i guess this would be ok.
Probably a couple of elements that are needed, like an extended API to
allow the apps to tell this to the shim (you probably want also to
inform the shim which locator is not working), and the shim needs to
remember the alternative locators obtained from the DNS even if there
is not shim context yet, in order to have a clue about which
alternative address to use (the other other option is to perform a
reverse lookup for retrieving those... see the threat with Erik for
more about this point). But in any case, i think all this issues are
easily solvable
but, i have an additional question about this point. the point is, if
the application is the one that will determine that there is a problem
and will ask the shim to establish a context (which is ok and no
problem here) wouldn't the application be better off simply retrying
with alternative locators by itself, rather then asking the shim to do
it?
The difference with shim6, instead of v4 is that in v4 world, the
connection wouldn't die, it would just hang for the duration of
convergence (provided convergence is fast enough, which normally it
is),
and then continue on it's merry way with new tcp windows. In Shim6, if
the client[ip1]-server connection goes down, re-establishing to
client[ip2]-server would not be "hitless" (ie session would die), and
to
solve that problem we are back at either keeping an inordinate amount
of
state on the webservers (which is not very realistic), a shift in the
way people write applications (which, in my opinion is preferred, but a
*very* hard problem to solve), or to somehow figure out how to hide
this
in the stack with minimal performance hit (let's say sub 1% memory hit)
when you have 30k+ simultaneous connections per server...
well if you use the shim approach that you suggest above, the server
does not have to store any shim state while things are doing fine and
if a client detects a problem it can trigger the creation of the shim
context from the client to the server. At this point, the server will
need some shim state, but only for those connections that have failed
(of course if one of the links to the server went down, then all the
clients connecting through that link will attempt to create a shim
state)
I guess that this could be a reasonable trade-off between state in the
server and response time when outages occur
:: > 3) While TE has been discussed at length already, but it is
something
:: > which is absolutely required for a content provider to deploy
shim6. There
:: > has been quite a bit of talk about what TE is used for, but it
seems that
:: > few people recognize it as a way of expressing "business/financial
:: > policies". For example, in the v4 world, the (multi-homed)
end-user maybe
:: > visible via both a *paid* Transit path (say UUNET), and a *free*
peering
:: > link (say Cogent), and I would wager that most content providers
would
:: > choose the free link (even if performance on that link is (not
hugely)
:: > worse). That capability all but disappears in the v6 world if the
Client
:: > ID was sourced from their UUnet ip address (since that's who they
chose
:: > to use for outbound traffic), and the (web) server does not know
that
:: > that locator also corresponds to a Cogent IP (which they can
reach for
:: > free).
::
:: I fail to understand the example the you are presenting here...
::
:: are you considering the case where both the client and the server
are both
:: multihomed to Cognet and UUnet?
:: something like
::
:: UUnet
:: / \
:: C S
:: \ /
:: Cognet
Yes, but now imagine the the "C" in this case is a client using shim6
with
multiple IP's, and the server is in IPv6 PI space. Also, if it wasn't
in
PI space, the connection to the server *can* be influenced via SRV
(although
that's trying to shoehorn DNS into where perhaps it shouldn't go --
since
now the DNS server needs to be aware of link-state in the network to
determine if the UUnet/Cogent connections are even up, and for a
sufficiently large "S", that could be 10's, or even 100's of links,
which
presents a very interesting scaling problem for DNS.. (even more
interesting is that most large content providers are actually in the
1000's, and that's why they can get PI space -- they are effectively
(at
least) a tier-2 ISP). But, back to the example at hand.. so, for the
sake of this example, let's say that the UUnet port is $20/Mbps, and
the Cogent port is a SFI (free) peer. So, the client (with ips of
IP-uunet and IP-cogent) picks IP-uunet (because they want to use their
UUnet connection outbound) to initiate a connection to the server, the
problem now comes from the fact that the server, when replying to the
client is unaware that IP-cogent IP is associated with the client,
(since the shim layer has not kicked in on initial connect) and will
have
to send traffic through the very expensive UUnet port.
that i don't follow
suppose that the server has v6 PI addresses, which for very big sites
makes sense imho
The server can send traffic with destiantion address belonging to UUNet
through Cognet, right? I mean i am assuming that UUNet and Cognet have
connectivity that is not through S
I mean, The client can choose to use the IP from UUNet (that is his
choice and he has the right to do so, because he is paying for it) This
choice, affects the ISP used to get _to_ the client and it shouldn't
determine the ISP used to get to the Server
So in this case the traffic would flow:
From the client to the Internet through UUNet
From the internet to the Server through Cognet
agree?
Now the problem is when the server also has PA blocks
In this case, the destiantion address selected by the client will
determine the ISP of the server
Without shim, the server don't have many options, basically what he
could do is to use the DNS to prioritize the Cognet addresses.
With the shim, the server can rehome any communication that is using
UUnet addresses to Cognet and start using Cognet locators. This of
course does not prevent the client to keep on using the UUnet
destination addresses. In this case, the server can inform the client
about his preferences using a shim protocol option, but even in this
case the client can prefer other than what is expressed by S in the
preferences. In any case, in this model, each can always choose the
path used to send packets. I guess that in IPv4 is somehow different
because the decision belongs to the intermediate ASes, which are to
ones that can select which path to use (note that in this case, is not
S who is in charge to select the incoming path neither)
With v4, on the
other hand, the router was aware that Client is reachable via both
Cogent
and UUnet, and could have had a localpref configured that would just
say
"anything reachable over cogent, use cogent". One way to fix that
would be
to do a shim6 init in the 3way handshake, but the problem then becomes
that *every* "S" would have to have a complete routing table, and
basically, perform the logic that is done in today's routers.
why is that?
I mean if S prefers Cognet, all he has to do is:
- In the PI case, route its outgoing packet through Cognet and do the
same v4 bgp magic to direct incoming packet through cognet
- In the PA case, always use cognet addresses and try to convince the
clients to use the server's IP address of the cognet prefix (through
SRV and/or shim preferences)
Obviously
running Zebra w/ full routes on a server is a non-trivial performance
hit,
and multiplied that out by the number of servers, and it gets
very expensive, very fast. All to re-gain capabilities we have right
now
in ipv4 for free...
Now, of course, the "so called easy" answer would be "let's introduce a
routing policy middleware box that would handle that part". That box
would
have the full routing tables, the site policies, and when queried with
"I'm server X, and this is the client and all his locators, which one
do I
use?" would spit back an answer to that server that would be a fully
informed decision, and the TE problem becomes mostly solved. I say
but there seems to be two different problems here (at least :-)
- one: which are the TE capabilities available with the PA addressing
model + the shim tool. This is what this can be done in this case.
- second: who is in control of these capabilties and how are they
managed i.e. who controls the policy and who manages the devices that
are in control of the policy. Is it possible to have a centralized
policy management? is it possible to enforce the usage of policy (at
least within the multihomed site)?
I guess that before we were considering the problem one and now the
second one...
This server idea that you are considering was presented by Cedric de
Launois in a work called NAROS a while ago
The other option is what we are discussing below about using a
DHCP/RAdv option to distribute the policy information among the hosts
The other option is to move to a scheme based on rewriting source
prefixes
Or a combination of those
"mostly", because now there are these pesky issues of a) do I trust
that
the server is going to obey by this decision (either hacked, or is a
box
outside of my administrative control, yet is within the scope of my
network control); b) how do transit ISP's "influence" that decision (at
some point I cross their network, and they should be able to control
how
the packets are flowing through their network; c) how do I verify that
their "influencing" doesn't negate mine, and is legitimate; d) how much
"lag" does it introduce into every session establishment, and is it
acceptable; d) can this proxy scale to the number of queries fired at
it,
and the real-time computations that would have to happen on each one
(since we can't precompute the answers); and finally is it *really*
more
cost-effective then doing all this in routers.
So far, I'd rather pay for bigger routers...
:: I mean in this case, the selection of the server provider is
determined by
:: the server's address not by the client address, right?
:: The server can influence such decision using SRV records in the
DNS, but not
:: sure yet if this is the case you are considering
See above about difficulties of scaling DNS to meet this goal...
but the problem with the DNS that you have considered above is about
having to achieve that the DNS publish information that reflect the
state of the links.
This seems indeed very dificult, especially becuase of cached
information and so on. But as far as i know, no one is proposing this.
The idea is to use SRV records to express policy, and a not very
dynamic one. I mean, you can express that like 30% of the
communications needs to use a given address and the others the other
address and so on, but the idea is not to allow the DNS to reflect the
state of the network
Actually, it may happen that some of the addresses in the DNS are down.
In this case, the idea is to let the hosts to detect this an retry
using alternative addresses. whether this retrial is visible or not to
the apps, is still an open issue
:: > This change alone would add millions to the bw bills of said
:: > content providers, and well, reduce the likelyhood of adoption of
the
:: > protocol by them. Now, if the shim6 init takes place in the 3way
:: > handshake process, then the servers "somewhat" know what all
possible
:: > paths to reach that locator are, but then would need some sort of
a
:: > policy server telling them who to talk to on what ip, and that's
something
:: > which will not simply scale for 100K+ machines.
:: >
::
:: I am not sure i understand the scaling problem here
:: Suppose that you are using a DHCP option for distributing the SHIM6
:: preferences of the RFC3484 policy table, are you saying that DHCP
does not
:: scale for 100K+ machines? or is there something else other than
DHCP that
Well, first, show me a content provider who thinks that dhcp scales
for a
datacenter (other then initial pxeboot/kickstart/jumpstart, whatever),
but
that aside, running zebra/quagga + synchronizing policy updates among
100K+ machines simply does not scale (operationally).
So, you are considering here the case where policy is changed according
to the state of the network, right?
So that BGP information is used as feedback to the TE decision, is that
correct?
Is this possible today? how is it done? could you provide an example of
how you use this dynamic TE setting?
:: > 4) As has also been discussed before, the initial connect time
has to be
:: > *very* low. Anything that takes longer then 4-5 seconds the
end-users have
:: > a funny way of clicking "stop" in their browser, deeming that "X
is down,
:: > let me try Y", which is usually not a very acceptable scenario
:-) So,
:: > whatever methodology we use to do the initial set-up has to
account for
:: > that, and be able to get a connection that is actually starting
to do
:: > something in under 2 seconds, along with figuring out which
sourceIP and
:: > destIP pairs actually can talk to each other.
::
:: As i mentioned above, we are working in other mechanisms than the
shim6
:: protocol itself that can be used for establishing new communication
through
:: outages.
::
:: you can find some work in this area in
::
::
ftp://ftp.rfc-editor.org/in-notes/internet-drafts/draft-bagnulo-ipv6-
:: rfc3484-update-00.txt
It's a fairly good idea of negotiating which SRC and DEST ips to pick,
but
it has to happen *fast* (ie sub 2 seconds), or the end-users will lose
patience, and declare the site dead. Perhaps racing SYNs?
yes, this is an option and it is nice because you actually get not only
to detect which ones are actually working but also to pick the fastest
one. But clearly there is the cost of the additional SYNs you send that
is basically overhead... would you be willing to pay for this multiple
SYNs?
Now, I'm not saying that all these problems can't be solved for people
to
consider shim6 a viable solution, but so far, they aren't solved, and
until they are, I just don't see recommending to my employer to take
shim6
seriously,
I may well agree with you here, but remeber that we are still defining
the protocol :-)
I guess the point here is how we can manage to provide a solution that
fits the site's requirements, hence your feedback is very valuable
Regards, marcelo
since it seems like all it's going to do is to move the costs
elsewhere, and quite possibly increase them quite a bit in the
process...
-igor