[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: shim6 @ NANOG (forwarded note from John Payne) (fwd)



Hey Marcello,

	comments are in-line...

:: > This makes a lot of sense, provided this happens under the hood of the
:: > application (ie web-browser in this case). So, right now, for example, if
:: > a client is pulling down a web page, gets the html, and and in the middle
:: > of downloading the .gif/jpg his session dies (ie TCP RST), the jpg that
:: > the client was in the middle of x-fering will get that ugly red "X".
:: > (most browsers, right now, will not re-try to get the object again, and
:: > will just show it as unavailable). This issue is deemed important enough
:: > that most large content providers are spending an inordinate amount of
:: > money on loadbalancer with active session sync to try to prevent that
:: > from happening in the even of a loadbalancer fail-over. So, if
:: > application behavior could be changed to say "if shim6 fail-over is
:: > possible, and connection just died (for any definition of die), then
:: > attempt re-establish connection through the shim, and then to re-get the
:: > failed object", that would go a long way in making this kind of fail-over
:: > better.
:: > 
:: 
:: but, i have an additional question about this point. the point is, if the
:: application is the one that will determine that there is a problem and will
:: ask the shim to establish a context (which is ok and no problem here)
:: wouldn't the application be better off simply retrying with alternative
:: locators by itself, rather then asking the shim to do it?

And therein lies the problem -- the applications *don't* do this 
(although, yes, I agree, in a perfect world the app developers should 
handle this, but they don't, and I don't really see software quality and 
practices getting any better, which is a whole other debate), which is 
why if you want shim6 to work, you need to handle this inside the shim, 
and not rely on the applications to do it. The application should just be 
a dumb thing that goes "connect" and the shim should handle everything 
for it (connection, redundancy, failover/convergence, everything), much 
like BGP does it today -- transparently to the app.

:: > The difference with shim6, instead of v4 is that in v4 world, the
:: > connection wouldn't die, it would just hang for the duration of
:: > convergence (provided  convergence is fast enough, which normally it is),
:: > and then continue on it's merry way with new tcp windows. In Shim6, if
:: > the client[ip1]-server connection goes down, re-establishing to
:: > client[ip2]-server would not be "hitless" (ie session would die), and to
:: > solve that problem we are back at either keeping an inordinate amount of
:: > state on the webservers (which is not very realistic), a shift in the
:: > way people write applications (which, in my opinion is preferred, but a
:: > *very* hard problem to solve), or to somehow figure out how to hide this
:: > in the stack with minimal performance hit (let's say sub 1% memory hit)
:: > when you have 30k+ simultaneous connections per server...
:: 
:: well if you use the shim approach that you suggest above, the server does not
:: have to store any shim state while things are doing fine and if a client
:: detects a problem it can trigger the creation of the shim context from the
:: client to the server. At this point, the server will need some shim state,
:: but only for those connections that have failed (of course if one of the
:: links to the server went down, then all the clients connecting through that
:: link will attempt to create a shim state)
:: 
:: I guess that this could be a reasonable trade-off between state in the server
:: and response time when outages occur

Ah, but if the server does not keep state (or at least is unaware of all 
of client's locators), doing outbound TE becomes *very* hard (see 
below)...

:: > : : I fail to understand the example the you are presenting here...
:: > : : 
:: > : : are you considering the case where both the client and the server 
:: > are both
:: > : : multihomed to Cognet and UUnet?
:: > : : something like
:: > : : 
:: > : : UUnet
:: > : : /     \
:: > : : C       S
:: > : : \     /
:: > : : Cognet
:: > 
:: > Yes, but now imagine the the "C" in this case is a client using shim6 with
:: > multiple IP's, and the server is in IPv6 PI space. Also, if it wasn't in
:: > PI space, the connection to the server *can* be influenced via SRV
:: > (although
:: > that's trying to shoehorn DNS into where perhaps it shouldn't go -- since
:: > now the DNS server needs to be aware of link-state in the network to
:: > determine if the UUnet/Cogent connections are even up, and for a
:: > sufficiently large "S", that could be 10's, or even 100's of links, which
:: > presents a very interesting scaling problem for DNS.. (even more
:: > interesting is that most large content providers are actually in the
:: > 1000's, and that's why they can get PI space -- they are effectively (at
:: > least) a tier-2 ISP). But, back to the example at hand.. so, for the
:: > sake of this example, let's say that the UUnet port is $20/Mbps, and
:: > the Cogent port is a SFI (free) peer. So, the client (with ips of
:: > IP-uunet and IP-cogent) picks IP-uunet (because they want to use their
:: > UUnet connection outbound) to initiate a connection to the server, the
:: > problem now comes from the fact that the server, when replying to the
:: > client is unaware that IP-cogent IP is associated with the client,
:: > (since the shim layer has not kicked in on initial connect) and will have
:: > to send traffic through the very expensive UUnet port.
:: 
:: that i don't follow
:: 
:: suppose that the server has v6 PI addresses, which for very big sites makes
:: sense imho
:: 
:: The server can send traffic with destiantion address belonging to UUNet
:: through Cognet, right? I mean i am assuming that UUNet and Cognet have
:: connectivity that is not through S

No, it can not. Cogent, being a SFI peer, will only advertise *their 
customer* blocks, which the UUnet address will not be a part of (but 
ip-cogent will be, but the server doesn't know that the 2 are equivalent). 
Therefore, since I only see this user from the ip-uunet, unless I somehow 
know that this guy is also ip-cogent, I can only respond to him via full 
transit (ie expensive) providers, in this case uunet. If the 
servers/network/whatnot was aware that locator1 = {ip-uunet, ip-cogent} 
and then able to equate the two, *then* we could make a correct TE
decision somewhere (the where is still to be determined, but at least now 
we are capable of it).

:: I mean, The client can choose to use the IP from UUNet (that is his choice
:: and he has the right to do so, because he is paying for it) This choice,
:: affects the ISP used to get _to_ the client and it shouldn't determine the
:: ISP used to get to the Server
:: 
:: So in this case the traffic would flow:
:: From the client to the Internet through UUNet
:: From the internet to the Server through Cognet
:: 
:: agree?

Ah, I see the confusion here.. so, in my example, I have no issues about 
the inbound path (client->server), my issues is with TE on the *outbound* 
path (server->client). Since I only see customer blocks from SFI peers, 
the sheer fact that the client picked ip-uunet has dictated that I would 
need to answer to him via ip-uunet, which means I have to do so over the 
uunet link, whereas in ipv4, I see the entire map, and I'm aware that 
locator1 is behind both uunet and cogent, and I can chose to take the 
cogent path.

:: > With v4, on the
:: > other hand, the router was aware that Client is reachable via both Cogent
:: > and UUnet, and could have had a localpref configured that would just say
:: > "anything reachable over cogent, use cogent". One way to fix that would be
:: > to do a shim6 init in the 3way handshake, but the problem then becomes
:: > that *every* "S" would have to have a complete routing table, and
:: > basically, perform the logic that is done in today's routers.
:: 
:: why is that?
:: 
:: I mean if S prefers Cognet, all he has to do is:
:: - In the PI case, route its outgoing packet through Cognet and do the same v4
:: bgp magic to direct incoming packet through cognet

Again, see above, just because S prefers cogent, doesn't mean that it's 
aware that this customer is reachable via cogent, and until he is aware, 
he can't make the proper decision (I think our disconnect is that you 
assume that people have full routes from everybody, which for networks 
closer to "the core" is absolutely not the case).

:: but the problem with the DNS that you have considered above is about having
:: to achieve that the DNS publish information that reflect the state of the
:: links.
:: This seems indeed very dificult, especially becuase of cached information and
:: so on. But as far as i know, no one is proposing this. The idea is to use SRV
:: records to express policy, and a not very dynamic one. I mean, you can
:: express that like 30% of the communications needs to use a given address and
:: the others the other address and so on, but the idea is not to allow the DNS
:: to reflect the state of the network
:: Actually, it may happen that some of the addresses in the DNS are down. In
:: this case, the idea is to let the hosts to detect this an retry using
:: alternative addresses. whether this retrial is visible or not to the apps, is
:: still an open issue

I see where you are going here, and it's not a bad idea, but there is a 
fundamental security problem problem here -- the SRV records would put the 
responsibility for following them onto the client, and well, who would 
trust them? Imagine a particularly smart worm writer who will look up his 
targets SRV records, and then decide to put all the load onto where his 
target doesn't want it, creating a very effective attack vector. 

:: So, you are considering here the case where policy is changed according to
:: the state of the network, right?
:: So that BGP information is used as feedback to the TE decision, is that
:: correct?
:: Is this possible today? how is it done? could you provide an example of how
:: you use this dynamic TE setting?

I have a feeling that we have a different definition of "policy". So, to 
me, there are 2 types of policies -- inbound policy and outbound policy. 
For outbound policy, BGP feedback is always used: if I can reach prefix X 
through an SFI link, and a paid link, the SFI prefixes are local-pref'd 
higher, and traffic will go out that way. If the SFI link goes down, or if 
the end-users link to the SFI peer goes down, the prefix is withdrawn on 
that peering session, and I will chose the next-best way (according to my 
policy) to get to the end-user. Inbound policy is even more interesting -- 
advertise my routes to ISP X, Y, and Z, and send communities to X to treat 
my routes as peer routes (ie advertise to customers only), communities to 
Y to bring my route-preference down to be below his peers (so only 
"active" if he doesn't see *any* other way to get to me), and to Z send 
them as-is. This way, I know that all customers of ISP X who send traffic 
to me will go through X will go thorough uplink-X, and *all other* traffic 
will come in through uplink-Z. If uplink-Z goes down, all other traffic 
will come through uplink-Y, and if uplink-X goes down, that traffic will 
re-route to uplink-Z, unless it's down, and it will go via uplink-Y. The 
inbound policies get even more interesting when you meet the peer in 
multiple places (which you usually do), and exchange MEDs with them to try 
to influence what routes they send to where, all depending on the state 
of the network, and which internal paths you or them have available at 
this time. 

So, perhaps the answer is that for people with SFI, they should get PI 
space, and that would probably take care of the majority of inbound policy 
issues (hell, you can easily make an argument that everyone who peers is 
an ISP), but that still doesn't solve the problem of lack of visibility 
for outbound TE.

:: yes, this is an option and it is nice because you actually get not only to
:: detect which ones are actually working but also to pick the fastest one. But
:: clearly there is the cost of the additional SYNs you send that is basically
:: overhead... would you be willing to pay for this multiple SYNs?

I would, yes, as long as the client, upon getting all the syn-acks, only 
picks 1 of them to ack again (ie i'm fine with doing 3-4x more syn 
cookies, as long as that's *all* i'm doing, since servers don't begin to 
keep track of state until the final ack w/ syn cookies, and most servers 
are capable of generating *lots* of syn cookies these days..)

Hope this helps,
-igor