[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: shim6 @ NANOG (forwarded note from John Payne)
On 24-feb-2006, at 19:47, Jason Schiller (schiller@uu.net) wrote:
I am baffled by the fact that Service Provider Operators have come
out in
this forum, at the IAB IPv6 multihoming BOF, and other places, and
have
explained how they and their customers use traffic engineering, yet up
until now, shim6 has not tried to provide thier needed functionality.
I think what we have here is a disconnect between what's going on in
the wg (and the multi6 design teams) and what's visible from the
outside.
I remember MANY conversations, in email and during meetings, about
traffic engineering. And for me, there has never been any question
that traffic engineering is a must-have for any multihoming solution.
Paying for two (or more) links and only being able to use one 99% of
the time is simply too cost-ineffective. And just maybe we can
convince people that shim6 makes for good multihoming even though it
doesn't give you portable address space, but it's never going to fly
if the TE is unequivocally worse than what we have today. (And I've
said this in the past.)
However, for a number of reasons this isn't all that apparent to an
outside observer:
- part these conversations were on closed design team lists, private
email or in (design team/interim) meetings (for instance, only 3% of
the messages in multi6 for the last couple of years mention TE)
- I don't think any of us, but at least not me, saw TE as a
particularly hard-to-solve problem
- TE can only happen if the base mechanisms are well understood, so
were focussing on those first
This is part of the reason more service providers are not envolved
in the
IETF.
"You have to do what we want or we'll boycot you"? This way, only
five people would be active in the IETF...
The other part as KC Claffy points out is cost
http://www.arin.net/meetings/minutes/ARIN_XVI/
ppm_minutes_day1.html#anchor_8
[Debugging broken PMTUD over IPv6 at ARIN]
I'm not sure which statement about cost you are referring to, or why.
Some history...
1. RFC-3582 attempts to document IPv6 multi-homing requirements.
Forget this RFC, it exists because of the inner workings of the IETF;
it doesn't do anything useful in the real world.
2. I tried to document the basic building block for TE.
-Primary / backup
-Load all links as best as possible
-Use best path
-any combination of these basic building blocks
-additional ability to increase or decrease traffic for any of these
The response I get is do people actully do this?
What I said was that I didn't understand why people want to have two
links and then have the second one sit idle until the first fails. I
know people want this because I used to configure this for customers
when I worked at UUNET NL. But my thinking is that if you have
multiple links, you'll want to use all of them.
3. IAB IPv6 multi-homing BOF
It seems to me that Service Provider Operators made a very clear
statememt
at the BOF.
-Traffic engineering is needed day 1.
I agree with that one.
* Traffic engineering should not be an end host decesion, but an
end site (network level) decesion [managing on the end host is
the wrong place]
If hosts can do congestion control they can do traffic engineering.
The only question is how to get site-wide policies into hosts.
* Traffic engineering needs to support in-bound and out-bound
traffic mamagement
Sure.
* Traffic engineering needs to be allowed by transit ASes as well
as end site ASes [don't leave all ISP TE in the hans of our
customers]
Are you saying that if I have two ISPs, those get to decide how I
balance my traffic over them? What if they turn this knob in opposite
directions?
Although I think it's useful for networks in the middle to be able to
express some pushback, I'm not sure if this is implementable for
sites that don't have a full BGP feed, and if it turns out this is
impossible or too hard to implement, I don't think that's a fatal
flaw. You don't get to push back on single homed customers either.
-First hit is critical
* establishing shim6 after the session starts doesn't help
short lived sessions
I'm not sure where this comes from. Since shim6 doesn't come into
play until there is a failure, and failures are too rare to be
meaningful in TE, the shim6 failover protocol itself is fairly
meaningless for TE. What we need is mechanisms to do source/
destination address selection in a way that can be traffic
engineered. Length of individual sessions is meaningless as shim6
doesn't work per-session. Most short sessions are part of a longer
lived interaction (i.e., a user visiting a WWW server and retrieving
dozens or hundreds of resources over the course of a dozen seconds to
many minutes).
* Keeping shim6 state on the end host doesn't scale for content
providers. A single server may have 30,000 concurrent TCP
sessions
Right. So there is precedent for storing state for 30000 instances of
"something". Servers are getting a lot faster and memory is getting
cheaper so adding a modest amount of extra state for longer lived
associations shouldn't be problematic.
(Visit a run of the mill content provider and see how many 100 byte
GIFs they send you over HTTP connections that have 700 - 1200 byte
overhead and of course all have high performance extensions turned on
for extra bandwidth wastage.)
-Maybe 8+8 / GSE seems to be a better starting point to support
transit AS
TE and to avoid the first hit problem and still allow for an "easy"
multi-homing for consumer customers ?
8+8/GSE won't work: it doesn't tell us how to do failover, it
requires changes to TCP and other upper layer protocols, and the
locator-identifier binding is insecure. On the surface, it may seem
that TCP/IP as we know it today is insecure to begin with, so the GSE/
8+8 insecurity doesn't add new holes. Unfortunately, it does. With IP
as it is today, when I want to pretend that I'm www.yahoo.com at the
IP level, I have to send out packets with a source address that
matches www.yahoo.com (which is generally easy) but I also have to
make sure that packets toward that address get back to me. On an
insecure (wireless) LAN this is easy, but once the packet ends up at
an ISP network, this isn't easy to do, and almost impossible to hide.
With 8+8 on the other hand, I can just create a packet that has the
Yahoo identifier, and my locator. This way, I can very easily get my
victim to talk to me while thinking he is talking to Yahoo.
Funny thing: you can look at shim6 as a next generation of GSE/8+8 (16
+16) that removes the problems listed above.
The response sounds to me that shim6 wg is finally interested in
considering decent TE as a "requirement". Yay! But I am concerned
about
what Operators and IETF folk think is "decent TE",
Let me speak for myself and speculate a bit: what we should do is
have multihomed sites publish SRV (or versy similar) records with two
values: a "strong" value that allows primary/backup mechanisms, and a
"weak" value that allows things like 60% of all sessions should go to
this address and 40% to that one.
Then, before a host sets up a session it consults a local policy
server that adds local preferences to the remote ones and also
supplies the appropriate source address that goes with each
destination address. New mechanisms to distribute this information
have been proposed in the past, but there is already a service that
is consulted before the start of most sessions, so it makes sense to
reuse that service. (No prizes for guessing what service I'm getting
at.)
This would allow for pretty fine tuned incoming TE, as long as the
other end doesn't have a reason to override the receiving site's
preferences.
I also imagine some use of measured and synthetic round trips to
select the "fast" path where possible. This can't be done in BGP: BGP
is pretty good at avoiding very bad paths, but it's not so good at
selecting the best ones.
|Yuck, you should never announce more specifics for this.
Please beleive the DFZ Service Provider's when the explain how
they, and
their customers do TE.
I believe that they do it, because I see that the global routing
table has increased by 16% last year. I have to admit that I've done
this myself from time to time, but only if AS path prepending (or
changing the origin attribute) wouldn't result in something
reasonable. It seems to me that for many people deaggregating is the
default these days. And then not just breaking a /20 into two /21s,
but go for broke and announce 16 /24s, who cares?
Take the picture below where cust1 has connectivity to UUNET and
at&t. cust2 has connectivity to Sprint and L(3). UUNET, at&t,
Sprint,
and L(3) all peer with each other.
UUNET---Sprint
/ | \ / | \
/ | \/ | \
cust1 | /\ | cust2
\ | / \ | /
\ | / \ | /
at&t------L(3)
-cust1 pay a flat rate to at&t and per packet to UUNET.
-cuts1 prefers to use the at&t link as primary (in and out bound)
-cust1 sends BGP comunity 701:80 to UUNET, and UUNET sets a local
pref of
80 on behalf of the customer
-cust2 has more out bound than in bound traffic.
-cust2 wants to load share all out bound traffic across both links
-cust2 wants traffic delivered to it over the "best" path
Traffic from cust1 to cust2
---------------------------
1. cust1 will send the traffic to at&t
2. at&t will decide if it is better to deliver traffic to cust2
via the exit point to L(3) or via the exit point to Sprint
3A. If at&t thinks the Sprint exit is more prefered, then
Sprint should deliver traffic to its customer over the
Sprint-cust2 link
3B. If at&t thinks the L(3) exit is more prefered, then
L(3) should deliver traffic to its customer over the
L(3)-cust2 link
*In this case at&t can do some TE. Sprint may actully be
closer or further than L(3), or at&t may artificially
distance or shorten Sprint, or may force certain prefixes
to prefer Sprint or L(3) [this is usally only the case for
purchased transit and not peering]
So far so good. Note that with shim6, it's possible (although
probably hard to do in practice) for cust1 to use four different
paths: uunet->sprint, at&t->l3, but also uunet->l3 and at&t->sprint.
So in the presence of congestion or scenic routing, there is a much
better chance for the customer to utilize the optimal path.
This is both good and bad for ISP/carriers as the customer experience
improves, but customers will more actively avoid "bad" paths so they
can't get away with those as much as they can now.
Traffic from cust2 to cust1
---------------------------
1. cust2 will spray traffic to Sprint and at&t
2A. UUNET is not advertising cust1 routes to Peers as
the best path is learned from a Peer and UUNET does
not provide transit to Peers.
3A. L(3) and Sprint will forward traffic to at&t
4A. at&t will forward traffic to their customer over the
at&t-cust1 link
2B. at&t is customers of UUNET instead of a Peer.
In this case UUNET will advertise the cust1
prefic to L(3) and Sprint.
3B. L(3) and Sprint will choose the best exit and
send the traffic either to at&t or to UUNET
4B. Traffic sent to UUNET will be delivered to at&t as
UUNET will honor the customer's low local pref community
Traffic sent to at&t (either from UUNET or L(3) or Sprint)
will be delivered over the at&t-cust1 link.
With shim6 and the TE I outlined earlier a correspondent would be
able to override the receiving site's wishes, which isn't possible in
the above scenario. However, it's unlikely that correspondents will
do this on a wide scale unless there is some reason why this is
beneficial to them.
In shim6 if cust1 chooses the Sprint IP address as the destination
then all transit ASes must deliver the traffic via Sprint. Transit
ASes have no capability to understand the destination lives behind
both Sprint and L(3), and threfore deliver the traffic to L(3) if
the L(3) exit point is better.
If the shim6 sites have access to a BGP feed they can still do
outgoing traffic engineering as usual. However, I expect that only a
subset of all shim6 sites will bother to run BGP so many will have to
depend on end-to-end information which will often be better than what
BGP supplies, and sometimes (a lot) worse, but never as easy to
change by ASes in the middle.
Transit AS TE is more critical in the case of moderate sized
transit AS
that is purchasing transit from multiple upstreams. Especally when
links
are cost prohibative. Take a large South American ISP that has 16
STM-1s,
where 4xSTM1 use the Americas 2 oceananic cable system to up stream
transit provider1, 4xSTM1 use the Emergia oceananic cable system to up
stream transit provider1, 4xSTM1 use the Americas 2 oceananic cable
system
to up stream stream transit provider2, and 4xSTM1 use the Americas 2
oceananic cable system to up stream stream transit provider2. Now
imagine
that your most important customer who always complains about latency
should always use the Americas 2 oceananic cable system to up stream
tranist provider1. Also imagine all other traffic should load all the
other links as equally as possible. and given that any one or more
links
fail, all the links should be loaded as equally as possible. Note:
This
is just one example of a real world customer.
Unfortunately this is incompatible with hop-by-hop forwarding for
outgoing traffic from the customer. Obviously this can be solved both
today and with shim6 using MPLS or similar.