[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[RRG] For Newbies: ITR, ETR, DFZ, TE, ITRD & other terms explained



I received a message from someone who has recently started to
follow the RRG list and had some trouble understanding some
terminology in the Ivip Conceptual Summary and Analysis document:

  http://www.firstpr.com.au/ip/ivip/Ivip-summary.pdf

Here is what I wrote, explaining some terms and concepts which are
basic to all map-encap schemes, and some others which are specific
to Ivip.

  ITR
  ETR
  Query Server
  DFZ
  TE

  RUAS  Root Update Authorisation System
  ITRD  Full Database ITR
  ITRC  Caching ITR
  QSD   Full Database Query Server
  QSC - Caching Query Server


   - Robin


ITR
---

An ITR is an Ingress Tunnel Router.  This is common to all the
map-encap schemes (LISP, APT, Ivip and TRRP - see the RRG wiki page
for details:
  http://www3.tools.ietf.org/group/irtf/trac/wiki/RoutingResearchGroup

All these schemes involve the address space (IPv4 assumed in the
following explanations, but also for IPv6) having some sections of
its range being devoted to "end-user networks".  The idea is that
all hosts sending packets to addresses in these ranges will have
those packets handled by an ITR.


ETR
---

This ITR tunnels the packet to an Egress Tunnel Router (ETR).  The
ITR requires some "mapping" information to tell it the address of
the ETR which is appropriate for this packet's destination address.
 The tunnelled part of the packet's path is between the ITR and ETR,
where the original packet is the payload in a larger packet, where
the new header is addressed to the ETR, not to the original packet's
destination address.

The ETR strips off the outer header, reconstituting the original
packet, and sends this directly to the end-user's network.

The idea is that an end-user network can be anywhere in the world,
such as Madrid, and that ITRs all over the world, close to sending
hosts, will tunnel the packets directly to an ETR in Madrid.  Later,
if the end-user network moves to Hawaii, the mapping is changed and
all the world's ITRs tunnel the packets directly to an ETR in
Hawaii.  This makes the address space portable, and usable with any
ISP who provides an ETR.  The ITR network, the mapping system and
multiple ETRs can be used to provide multihoming and Traffic
Engineering.

In Ivip, the mapping information is simple:  If a destination
address is within a "micronet", then there is one ETR address to
which all packets addressed to addresses within that micronet must
be sent.  (Bill Herrin suggested the term "micronet" and I now use
it in Ivip.)

In LISP, the term "EID prefix" means the same thing as "micronet".
An EID prefix's mapping information consists of one, two or more ETR
addresses and some information about priorities for choosing them,
regarding multihoming and Traffic Engineering in the form of load
sharing.  I think APT and TRRP are similar.

These schemes: LISP, APT, Ivip and TRRP are "map-encap" schemes,
meaning that the packet's destination address is used by an ITR to
look up some mapping information, which gives the ITR one or more
ETR addresses to which it will be tunnelled.  "Encapsulation" means
putting the original packet in a larger packet, and sending it to
the ETR - which constitutes a tunnel.

These schemes differ considerably in where ITRs and ETRs are
located, but the most dramatic differences between them are in how
the ITR gains the mapping information.


Example of multihoming service restoration
------------------------------------------

Please see the diagram at the start of:

  http://www.firstpr.com.au/ip/ivip/

A packet from sending host H1 is addressed to receiving host IH9,
which has an address in a micronet which is part of a Mapped Address
Block which is managed by Ivip.

The packet leaves the network and is forwarded to ITR1, which is an
"anycast ITR in the DFZ/core".  Initially, the mapping for this
micronet is to tunnel the packet to ETR1, which is in ISP N3, one of
the two ISPs used by the multi-homed end-user.

When any of the following occur:

  1 - N3's network is unreachable from the rest of the Net.

  2 - ETR1 dies.

  3 - The end-user's link to ETR1 fails.

then something (probably a separate commercial global multihoming
monitoring system which the end-user pays for and which controls the
mapping of their micronet) changes the mapping to tunnel these
packets to ETR2 instead.

Then the packets are delivered to IH9 again.

Packets from H3 are handled in a similar way, except its network N2
has its own ITR.


Query Server
------------

"Query Server" is a general term for any server which responds to
queries.  In the context of the RRG discussions, I use the term
specifically for certain elements of Ivip, and also more generally
to refer to what I regard as "query servers" in other proposals.

For instance, in LISP-ALT, a query from an ITR about mapping is
passed over the ALT network to some device which sends an answer to
the ITR.  That may be an ETR or something else, but I refer to it,
in a general sense, as a "query server".

In APT, the "Default Mapper" is a "query server" (as well as being a
full database ITR).  In TRRP, the authoritative nameservers in the
trrp.arpa domain and its subdomains are "query servers".

APT's and Ivip's query servers are local, but LISP-ALT's and TRRP's
are (in general) located somewhere in a global network.


DFZ
---

"DFZ" means "Default Free Zone".  The Internet's inter-domain
routing system uses routers which compare notes with each of their
peers (other routers they have direct links to) using BGP (Border
Gateway Protocol) about the best path to send packets on, according
to which of many prefixes the packet is addressed to.  In IPv4,
there are currently about 250,000 such prefixes advertised in BGP:

  http://bgp.potaroo.net

Each such prefix is announced (typically) by one or more border
routers of ISPs or of end-user networks.  In order to participate in
the BGP system, and thereby have a direct connection to the "core"
of the Internet, each such ISP or end-user network needs an
Autonomous System (AS) number (ASN).

The router's talk to their peers about each such prefix, telling
each peer an intentionally simplified measure of how hard it is for
the router to deliver packets addressed to each prefix.  The value
given is the number of Autonomous Systems the packet would have to
pass through.  This value may be boosted above the true value
according to operator's desire not to handle such packets.  Routers
decide where to send packets according to which peer reports the
lowest value, and according to locally programmed policies.

Consider a BGP border router of an ISP or end-user network, where
the ISP or end-user network has a single prefix: 50.0.0.0/20.

If this network is "single-homed" (the opposite of multi-homed) then
this router has a single upstream peer - a single BGP router in
another AS by which to send and receive packets to and from the rest
of the Internet.  This is a pretty simple task:

  If the destination address is within 50.0.0.0/20, it needs to
  be sent somewhere in the local network.

  If not, it needs to be sent to the upstream link.

This means the single-homed router's FIB (Forwarding Information
Base) functions which actually handle each traffic packet can be
pretty simple - they only need to test for 50.0.0.0/20 and any
smaller ranges of addresses within this (longer prefixes).  If the
packet doesn't match one of these, then it is sent to the "default
route" - which is to the single upstream link.   Even if the local
network has a few dozen or a few hundred prefixes, this is still a
relatively small task compared to what a multihomed BGP router must do.

Also, the single-homed router only needs to have its BGP
conversations with a single peer - the router on the upstream link.
 It doesn't really matter what values the peer tells it about the
250,000 or so BGP advertised prefixes, since the border router has
no other place to send packets which don't match 50.0.0.0/20.

Now consider a multihomed ISP or end-user network.  Its border
router has two or more upstream links - two in this example.

For every one of the 250,000 prefixes (apart from whichever of those
are advertised by its own network), the multihomed border router
needs to make a decision about whether it is best to send them to
upstream link A or  upstream link B.  Generally, the packet would be
delivered either way, but one way will typically be "shorter" (by
the crude "number of ASes" measure used by BGP) then the other.

So the router's CPU conducts a set of 250,000 conversations with
each of its upstream peers - A and B.   Then, it makes 250,000
decisions about which of these links is the best one to send packets
on for each such prefix.  Any time local policy changes
sufficiently, or the reports from its peers change their reports
sufficiently, this router may decide that it should send packets for
this prefix to a different peer than the one it currently sends them
to.  Whenever it makes such a change, it announces this to its
peers, with the appropriate number of ASes in the announcement.  (A
crucial low-level detail is that the announcement contains a list of
ASNs through which the packet would travel, so other routers can
avoid using paths which include their own ASN.  This is a robust way
of preventing routing loops.)

Then all these 250,000 decisions are programmed into the router's
FIB to handle packets in this way, which means the FIB section
(typically expensive hardware) in the router needs to be able to
cope with this many divisions.

The BGP router of a multihomed network (any BGP router with two or
more "upstream links - or any "transit" router, which is between
multiple ASes and has no network or its own - must always engage in
multiple sets of conversations with its multiple peers.  Likewise,
its FIB always needs to have at least 250,000 separate rules by
which it can instantly (less than a microsecond or so) classify
incoming packets so as to forward them to whichever of the router's
interfaces lead to the best link for these packets.

These multihomed and transit BGP routers cannot depend on the simple
arrangement of testing for local prefixes and, if there is no match,
sending the packet according to the "default route".  Their task is
much more demanding.  So multihomed and transit BGP routers are said
to be in the "Default Free" Zone!

There are something like 200,000 such routers in the DFZ - see the
"Routers in DFZ - reliable figures from iPlane" thread last year:
http://psg.com/lists/rrg/2007/ .   Also, someone mentioned a
similarly rough figure on the RRG list recently.

Problems with the cost of these routers, and with delays and
stability problems as they try to figure out the best path for
packets, via their 250,000 conversations with each peer, are the
main driving force behind the RRG's project of devising a new
architectural solution to this routing scaling problem:

  http://tools.ietf.org/html/rfc4984
  http://tools.ietf.org/html/draft-irtf-rrg-design-goals-01
  http://tools.ietf.org/html/draft-narten-radir-problem-statement-01

The primary problem is that the only way a network can gain portable
address space, and/or address space which can be used for
multihoming and Traffic Engineering, is by getting its own one or
more prefixes from an RIR and advertising it (or splitting it into
longer prefixes, such as a /20 into 16 /24s) BGP.  Each such prefix
represents a further burden on all DFZ routers.

Bill Herrin attempts to estimate the cost of every such prefix:

  http://bill.herrin.us/network/bgpcost.html

and arrives at the conservative estimate that every time someone
advertised such a prefix, it costs everyone else USD$8000.

Part of his estimate is that the price premium of a router which can
handle the DFZ tasks is at least USD$30,000.  I think this refers to
the difference between a router which can perform the multihomed BGP
border router functions and one which can't, but could do a
single-homed BGP border router function.  Perhaps it means the price
difference between routers which can't handle BGP (and its 250,000+
prefixes) at all, and those which can, for both single- and
multi-homed border router scenarios.


Such prefixes are variously known as "BGP advertised prefixes", "DFZ
routes" etc.  The total set of them may be known as the "global
routing table", the "DFZ routing table" or sometimes just the "DFZ".
 Hence, "injecting a route into the DFZ" means a network advertising
a prefix via its BGP border router, adding one more prefix (AKA
"route") to the ~250,000 already existing.

Our primary goal is to devise some new architecture which will
enable large numbers of end-user networks (not ISPs, who really need
full BGP-managed address space) to get address space which is
portable and usable for multihoming and Traffic Engineering, without
adding further to the bloat in the "DFZ routing table".  The Net is
not going to stop functioning if the current 250k size grows past
some limit, but cost and instability problems will get worse unless
something is done.  Since there are millions of end-users who will
want and arguably need multihomable address space, we clearly need a
new way of providing for their needs.

While we are about doing this, for instance with a map-encap scheme,
some of us also want to enable finer and less expensive divisions in
the IPv4 address space to enable higher rates of utilization - to
combat the IPv4 address depletion problem.

Since any global network of ITRs and ETRs is an extraordinarily
powerful tool which has not been contemplated before - but which
apparently needs to be built to solve the routing scaling problem -
 some of use want to ensure it supports new approaches to mobility
(rapidly moving a device to another physical or topological
location, but keeping its IP address or address prefix).  Existing
approaches to mobility require extensive changes to host operating
systems and generally find it a challenge to maintain optimally
short path lengths for the packets.


TE
--

"Traffic Engineering" (TE) . . .   Some other RRG folks could
probably provide a more comprehensive definition, but for this
discussion, TE means the ability of an edge network (and/or its ISP)
to control the path of packets over multiple alternative links,
usually according to what type of traffic the packets are part of.
This might be according to the packets' destination address, or
perhaps according to whether it is an HTTP or a VoIP voice packet.

An example of outbound TE is:  a multihomed network has two upstream
links and programs its border router to send some types of packets
out link A and the rest out link B.  This may achieve the goal of
load sharing, or favour one link because it is faster, cheaper, more
reliable etc. than the other.  Outbound TE is easy, and not a
problem for any map-encap scheme.

The real challenge is inbound TE.  How, with a map-encap scheme, can
a multihomed end-user edge network control the global ITR system so
that some traffic comes in via link A (meaning the packets are
tunnelled to an ITR in ISP-A) and the rest via another ETR in ISP-B,
arriving over link B?

LISP, APT and TRRP include TE constructs in their mapping
information, requiring each ITR to make decisions about which of
multiple ETRs to send the packets to.  Ivip has no such explicit TE
functions.  To achieve TE like this, the address space in question
must be split into two or more micronets, thereby splitting the
traffic (this won't work if all traffic is to one address) and then
by mapping each micronet to a different ETR.


RUAS - Root Update Authorisation System
---------------------------------------

In Ivip, there are multiple BGP advertised prefixes within which the
address space is managed by Ivip's mapping system.  Each such prefix
is called a Mapped Address Block (MAB).  Therefore, ITRs find
packets addressed to any one of these MABs and tunnels them to an
ETR, according to the mapping for the micronet to which the packet
is addressed.  (All addresses within a micronet have the same
mapping information - simply an ETR address.)

Each MAB - such as a /12, /16 or /20 - typically contains many areas
called User Address Blocks (UABs), each of which is controlled by a
single end-user.  The end-user decides how to split those UABs into
micronets, and then decides the mapping (ETR address) for each micronet.

There are multiple RUASes.  Each one is authoritative for one or
more MABs.  Therefore, each end-user has a direct or indirect
relationship with the RUAS which is responsible for controlling the
mapping of the MAB within which its UAS is located.

RUASes work together to create a stream of updates, such as one
every second, which are sent out through a cross-linked
tree-structured system of "Replicators" to all the world's full
database ITRs and full database query servers.  The goal is for an
end-user's mapping change command, directly or indirectly to an
RUAS, to be received by all the world's ITRs within 5 seconds.  This
includes caching ITRs which are handling packets - or at least have
recently handled packets - addressed to to this micronet.


ITRD - Full Database ITR
------------------------

These ITRs get the full feed of mapping updates, and therefore
maintain in real-time a copy of the entire Ivip mapping database.

This means when they get a packet addressed to any MAB, they already
have the mapping information for all the micronets in that MAB and
therefore know which ETR the packet should be tunnelled to.

In the longer-term future, when there are millions or billions of
micronets, it is unlikely that every (or any) ITRD will have its FIB
already programmed to handle packets addressed to every possible
micronet.  More likely, across the whole address range, the FIB will
specify something like:

  1 - Forward the packet to an interface, as already determined by
      decisions made with BGP conversations with peers.
      (Conventional BGP-based forwarding as is done today.)

  2 - The packet is addressed to a particular micronet, and
      the ETR address to tunnel it to is therefore: aa.bb.cc.dd

  3 - The packet is addressed to a MAB, but the FIB does not
      currently have the ETR address for its micronet.  So query
      the router's central CPU, find the mapping and then
      go to step 2.

  4 - Handle the packet in some other way.

  5 - None of the above - drop the packet.

So I think ITRDs of the future will be a "caching ITR" in their FIB
packet handling section, with an inbuilt full database query server.

I think the same would be true of the full database ITRs of other
schemes: all ITRs in LISP-NERD and the Default Mappers of APT.



ITRC - Caching ITR
------------------

An ITR which doesn't have a full copy of the mapping database.  It
sends a query to another device, a full database query server or a
caching query server, and pretty soon (tens of milliseconds max,
unless a query or response packet is lost, or the query server is
dead) gets a response back.  It holds the packet until it gets the
response, and then tunnels it according to the mapping information.

ITRCs clearly need less storage and ITRDs.  Also, they don't need to
get the continual feed of mapping updates.  So ITRCs are much
cheaper and can be much more numerous.  This spreads the load of ITR
work, and by bringing the ITR function closer to the sending host,
helps ensure the total path taken by the packet is as short as possible.

In other proposals, all LISP-ALT ITRs are caching ITRs, and so are
all ITRs in TRRP.  The ITRs in APT which are not Default Mappers are
also caching ITRs.  Only LISP-NERD has no caching ITRs.

Ivip also has the option of an ITRC function being built into a
sending host (ITFH - ITR Function in Host).  The sending host's
address can be an Ivip-mapped address, or an ordinary BGP-managed
("RLOC" in LISP terms) address, but it cannot be behind NAT.  The
primary reason for this is that an ITRC needs to be reachable by a
query server when the query server sends it a "Notification" that
some mapping has changed for a micronet the ITRC recently made a
query about.  ITFHs should be essentially zero-cost ITRs, with
optimal path lengths.


QSD - Full Database Query Server
--------------------------------

Like an ITRD, a QSD gets the full feed of mapping updates and so has
a real-time updated copy of the whole mapping database (the sum
total of all the mapping for all the MABs from all the RUASes).

QSDs are intended to be in ISP and end-user networks so nearby
ITRCs, ITFHs and QSCs can quickly and reliably get the mapping
information they need.  The QSD keeps a record of queries it
answered recently, and the caching times for the mapping information
it sent back, so that if mapping for one of these micronets changes
within a caching time, the QSD will send out a "Notification" with
the new mapping information to the querier.

In APT, the Default Mapper is also a "local" full database query
server.  It has the full database and is within an ISP network, for
the caching ITRs in that network, meaning the replies come quickly
and reliably.


QSC - Caching Query Server
--------------------------

Ivip has the option for caching query servers.  It may be best for a
network to have one or a few QSDs, and to enable ITRCs and ITFHs to
query some closer, more numerous and cheaper and more lightly loaded
QSCs, rather than directly querying the few expensive, busy,
probably more distant, QSDs.  This way, very often (ideally) the
local QSC will already have the mapping a particular ITRC/ITFH
needs, since it was probably asked about the same micronet recently
by some other ITRC/ITFH.  This reduces load on the QSDs and speeds
the response time for some or many queries.

QSCs pass on Notifications to whichever device queried them about
the micronet which has just had its mapping changed.

An ITRC or ITFH could query a QSC or a QSD.  Each QSC could query
directly to a QSD, or ask another QSC.  In the latter case, there
could be multiple levels of QSC, but eventually the answer would be
given by a QSD.




--
to unsubscribe send a message to rrg-request@psg.com with the
word 'unsubscribe' in a single line as the message text body.
archive: <http://psg.com/lists/rrg/> & ftp://psg.com/pub/lists/rrg