**An Architecture for Hyperscale NAT64**
Brian Candler, DRAFT 2016-03-14
# Abstract
This document outlines a design for a NAT64 translator cluster which can
scale to arbitrary traffic volumes, as would be required for translating a
non-trivial proportion of Internet traffic.
# Motivation
Commercial large-scale or "carrier-grade" NAT64 implementations are
available, but they are proprietary, expensive, and of uncertain
scalability.
The design outlined here has the following characteristics:
* Horizonally-scalable: simply add more boxes as traffic, CPU and/or
RAM limits are hit
* Add more IPv4 address pools as concurrent client usage increases
* Low disruption to existing flows as pools are grown
* Deterministic selection of translated IPv4 address and dynamic
selection of port range
* Minimal logging of translated flows required for abuse tracking
and law enforcement
* Suitable for implementation on commodity hardware
# High-level design
## Two-layer architecture
A collection of hosts logically form a single "translator cluster", which
translates IPv6 traffic to destination addresses of the form NPFX::z.z.z.z
to IPv4 address z.z.z.z (where NPFX is the chosen IPv6 prefix for the
cluster)
Internally, the cluster comprises a number of hosts grouped in two stages: a
NAT66 stage and a NAT64 stage. (Combining both functions into a single host
is perfectly possible but not considered here)
~~~
s USER:a
s USER:a d STG2:x:y:z s x (IPv4)
d NPFX::z +---+ ,----------------->+---+ d z (IPv4)
------------->| | --.,--------------->| | <---------->
+---+ \/\ +---+
+---+ /\ `-------------->+---+
------------->| | ---\--------------->| | <---------->
+---+ \ \ +---+
STAGE 1 \ `------------->+---+
`--------------->| | <---------->
+---+
STAGE 2
~~~
The first stage is a *stateless*, deterministic selection of IPv4 source
address as a function of the IPv6 source address (USER:a). All stage 1
hosts are identical. The destination IPv6 address is rewritten to an
intermediate form which contains both the selected IPv4 source (x) and
target destination IPv4 address (z). These intermediate addresses lie
within a internal prefix STG2::/32
IPv6 routing then naturally delivers the packet to the correct stage 2 box
which announces the prefix covering STG2:x::/64.
The second stage is a traditional *stateful* NAT64, except that the source
IPv4 address (x) has already been selected and is extracted from the IPv6
destination address.
Each stage 2 box "owns" its own range or ranges of IPv4 addresses, and
regular IPv4 routing is used to ensure the return traffic hits the right
box. After conversion back to IPv6, the return traffic can be dumped
straight back into the IPv6 network, and need not touch the Stage 1 boxes
again.
Scalability is therefore achieved by:
- building sufficient stage 1 boxes for the source selection algorithm;
- breaking up the total IPv4 pool into small enough pieces that a single
stage 2 box can handle the required proportion of overall traffic.
## Format of intermediate addresses
The intermediate addresses are rewritten inside a /32 IPv6 prefix. This
need not be globally routable: ULA addresses are fine. The example here
uses fd00::/32
These addresses are structured as follows:
~~~
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| STG2 prefix | IPv4 source |chkcomp|portsel| IPv4 dest |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
fd 00 00 00
(example)
~~~
The field "chkcomp" is used to compensate for changes in the upper layer
checksum which would otherwise be required, using the algorithm in RFC6296.
The field "portsel" may be used to influence stage 2 translator operation,
and is described later but is not required for basic NAT64 operation.
Both fields are shown as zero in the following exposition.
## Example of operation
Let's say we have three stage 2 servers:
* S2a has IPv4 pool 192.0.2.0/26
* S2b has IPv4 pool 192.0.2.64/26
* S2c has IPv4 pools 192.0.2.128/26 and 192.0.2.192/28 (perhaps because
it is a newer and more powerful box, and we wish it to take a higher
share of traffic)
This gives a total of 208 IPv4 translated addresses available.
Each S2 server announces an IPv6 prefix or prefixes with the IPv4 prefix(es)
it owns shifted left 64 bits (i.e. in bits 32-63 of the IPv6 address).
These announcements are made towards the S1 servers.
* S2a announces `fd00:0000:c000:0200::/58`
* S2b announces `fd00:0000:c000:0240::/58`
* S2c announces `fd00:0000:c000:0280::/58` and
`fd00:0000:c000:02c0::/60`
To clarify the S2b route:
~~~
fd00:0000 : c0 00 : 02 40 : 0000:0000:0000:0000
11000000 00000000 00000010 01000000
---------------------------------------->| /58
is equivalent to
fd00:0000 : 192 . 0 . 2 . 64 : 0000:0000:0.0.0.0
11000000 00000000 00000010 01000000
---------------------------->| /26
~~~
Let's assume the IPv6 translator prefix announced to the public IPv6
Internet is 2001:db8:6464::/96.
Now consider what happens when an IPv6 packet arrives with source
2001:db8:1234::5 and destination 2001:db8:6464::203.0.113.99.
1. The packet arrives at any one of the stage 1 servers
2. The stage 1 server looks at the source IPv6 address, and uses
this to select one of the 208 available IPv4 addresses
deterministically. Let's say it picks 192.0.2.100
3. The stage 1 server rewrites the destination IPv6 address as
`fd00:0000:192.0.2.100::203.0.113.99`, more correctly represented as
`fd00:0000:c000:0264::203.0.113.99`, and sends it out.
4. This packet arrives at whichever stage 2 server which is announcing the
corresponding IPv6 prefix, which in this case is S2b. (Observe that
`fd00:0000:c000:0264::203.0.113.99` is within
`fd00:0000:c000:0240::/58`)
5. S2b goes through a regular NAT64 process, but choosing the IPv4 source
address from bits 32-63 of the destination IPv6 address. It sends out
an IPv4 packet with source 192.0.2.100 and destination 203.0.113.99, and
creates internal state. This packet enters the IPv4 Internet.
6. The return packet arrives back at S2b, is looked up in the state tables,
untranslated to source 2001:db8:6464::203.0.113.99 and destination
2001:db8:1234::5, and sent on its way to the IPv6 Internet.
7. Future IPv6 packets for this session can hit any stage 1 box, but as
long as the address selection algorithm is consistent, they will be
directed to the same stage 2 box as before, which is where the correct
state is stored.
The stage 2 operation is slightly different to standard NAT64 in two ways.
Firstly, it needs to extract a specified source IPv4 address from the
destination IPv6 address (step 5). Secondly, it constructs response packets
using the translator's public prefix (step 6). This latter step is an
optimisation to avoid packets returning via the stage 1 translators.
The above example is only slightly simplified: it ignores the checksum
compensation (in step 3); it assumes that source port selection is entirely
at the discretion of the stage 2 box (in step 5); and that the cluster has
only one public IPv6 translator prefix (in step 6). Those details will be
presented next.
## Checksum compensation
Bits 64-79 of the translated destination address (the "chkcomp" field) are
set using the algorithm in sections 3.2 and 3.5 of RFC6296, to make the
overall stage 1 translation checksum-neutral. This is a simple calculation
which avoids any further need for the stage 1 translator to inspect or
update the upper-level protocol checksum (e.g. TCP/UDP).
## Multiple translator prefixes
It is possible the translator cluster operator will want to serve multiple IPv6
NAT64 prefixes using the same cluster. Examples are:
* It might want to announce both its own unique block `2001:db8:6464::/96`
and an anycast block like `64:ff9b::/96`
* In the event of one translator cluster failing totally or being taken off-line,
it may be desirable for a different translator cluster to announce the prefix
of the failed one in addition to its own
This can be easily handled, but the stage 2 translator needs to know which
prefix was used for any particular session, so that the return traffic can
be rewritten with the right IPv6 source address applied (in step 6 of the
worked example).
The mechanism proposed is to use the lower two bits from the "portsel" field
for this (bits 80-95), which is the remaining 16-bit field in the
intermediate translated address.
~~~
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
80 94 95
~~~
Both the stage 1 and stage 2 translators are configured with a static
mapping table of up to four public IPv6 prefixes.
The stage 1 translator inserts the P bits in the intermediate address,
dependent on the original destination IPv6 address seen (before it is
rewritten into the intermediate form). The stage 2 translator records them
along with the session state; and when incoming IPv4 packets are received,
uses this state to select the appropriate IPv6 source prefix for return
packets.
(An alternative would be to use high bits from the STG2 prefix, e.g. bits
16-31)
# Static A+P configuration
Some ISP/carrier configurations may wish to use an explicit static mapping
from IPv6 prefixes to IPv4 source address and port range. This can be
configured simply using a static prefix lookup table, distributed to each of
the stage 1 servers.
The stage 1 server selects both an IPv4 address and a port range for each
IPv6 prefix. The remaining "portsel" bits in the intermediate IPv6 address
are used to signal the selected port range to the stage 2 host.
The exact use of these bits is up to the design of the translator cluster,
but here is a simple approach:
~~~
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| B B B B B B | 0 0 | N N N N N N | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
~~~
A value of all zeros in all of the first 14 bits (B=0, N=0) means that the
stage 2 translator is permitted to use the whole range of source ports
(1024-65535). Otherwise it breaks down as:
* B: concatenate with 10 zeros to find start of port range (B>=1)
* N: size of port range is N*1024 ports (N>=1)
Ports 0-1023 are never permitted. If the end of the port range would extend
beyond 65535 it is truncated.
This allows port ranges of between 1024 ports up to the full range.
Example:
~~~
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 0 0 1 1 0 0 0 | 0 0 0 0 1 0 0 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
~~~
* Port range is 0x9800 - 0x9fff (38912 - 40959): 2048 ports
* (Prefix selector = 00)
# Dynamic A+P operation
For Internet-scale use it is not practical to configure explicit per-client
mappings. A zero-configuration approach is described here.
## Consistent hashing
The approach by which a stage 1 server selects an IPv4 address to use is
called *consistent hashing*. It works as follows.
1. Define a hash space; let's say a 64-bit space. You can consider this as
a circle which starts from 0 to 2^64-1 and wraps around to 0.
2. Expand every available IPv4 prefix into individual IPv4 addresses, N in
total. Each IPv4 address is hashed into this hash space M times with M
different seeds. We therefore have N x M points on this circle, each
point associated with a single IPv4 address.
3. (Part of) each IPv6 source address is also hashed into a 64-bit value,
giving another point on the circle.
4. To find the IPv4 address to use, move back along the circle until you
find one of the points made in step (2). The IPv4 address associated
with this point is the one to use.
The factor M provides better distribution of load and re-distribution of
hashes if an IPv4 address is removed (or a new IPv4 address is added). With
M=256, each IPv4 address covers 256 random segments of the circle. If the
address is lost, then users who were assigned to those segments will instead
pick up the points earlier on the circle, i.e. will be spread across
approximately 256 other IP addresses. Similarly, if a new address is added,
it will pick up a small proportion of users from other addresses.
Crucially however, the *majority* of users continue to use their existing
IPv4 address unchanged, and hence their NAT64 sessions are unaffected.
The following example shows prefix 192.0.2.0/30 (4 IPv4 addresses), M=4, and
a 16-bit hash space
~~~
*0fee [192.0.2.0]
[192.0.2.3] ff7c* *3559 [192.0.2.1]
[192.0.2.2] f9ee* *3d95 [192.0.2.1]
[192.0.2.3] f763* *403e [192.0.2.3]
[192.0.2.0] e482* *4ace [192.0.2.1]
[192.0.2.2] e1c5* *4cdd [192.0.2.3]
[192.0.2.0] e0c2* *5e23 [192.0.2.1]
[192.0.2.2] e0bc* *bd91 [192.0.2.0]
*d154 [192.0.2.2]
~~~
These values were obtained by taking the first 16 bits of the MD5 hash of
each of the following strings:
* "192.0.2.0|0" = bd91
* "192.0.2.0|1" = e482
* "192.0.2.0|2" = e0c2
* "192.0.2.0|3" = 0fee
* "192.0.2.1|0" = 4ace
* "192.0.2.1|1" = 3d95
* "192.0.2.1|2" = 3559
* "192.0.2.1|3" = 5e23
* "192.0.2.2|0" = d154
* "192.0.2.2|1" = f9ee
* "192.0.2.2|2" = e1c5
* "192.0.2.2|3" = e0bc
* "192.0.2.3|0" = f763
* "192.0.2.3|1" = 4cdd
* "192.0.2.3|2" = 403e
* "192.0.2.3|3" = ff7c
Now suppose we get a packet with source address "2001:db8:1234::5". This
hashes to 371c. The point in the circle before this is 3559, and therefore
the assigned IPv4 address is 192.0.2.1.
The actual hash doesn't really matter, as long as it gives a reasonable
random spread. The example above turns out to quite badly distributed: some
ranges are very large (5e23...bd91) and some tiny (e0bc...e0c2), and the
overall distribution of IPv4 usage is:
* 192.0.2.0: 0fee...3559, bd91...d154, e0c2...e1c5, e482...f763 = 19730 = 30.1%
* 192.0.2.1: 3559...403e, 4ace...4cdd, 5e23...bd91 = 27746 = 42.3%
* 192.0.2.2: d154...e0c2, e1c5...e482, f9ee...ff7c = 6073 = 9.3%
* 192.0.2.3: 403e...4ace, 4cdd...5e23, f763...f9ee, ff7c...0fee = 11987 = 18.3%
However, this is due to the small size of the example, and a larger factor M
improves the spread dramatically. In this case, using the same 4 IP
addresses but M=256 would give utilisation of each IP address of 24.32%,
24.19%, 25.16% and 26.34% respectively.
In practice, a higher-performing algorithm than MD5 would be used, at least
when processing the incoming IPv6 source addresses; and all the processing
would be in binary not ASCII representation.
A suitable hash algorithm would give good random spread and yet be efficient
to implement on 64-bit processors and/or direct hardware implementations.
(TODO: assess suitability of algorithms. MurmurHash3-128? CityHash64?
Older functions like CRC-64-ECMA, Fletcher-64? Note: Nehalem/i7/SSE4.2 has
a CRC32 primitive on-chip!)
(TODO: if two addresses hash to *exactly* the same value, need to define
which takes precedence)
## Masked IPv6 source bits
It is not desirable to hash the whole 128 bits of source IPv6 address when
selecting the IPv4 address to use.
If we used all 128 bits, then all the users in any network will be spread
evenly over all the available IPv4 addresses; the use of
periodically-changing privacy addresses will ensure that one network will
eventually make use of *all* available IPv4 addresses at some point.
We can improve this by hashing only a portion of the source IPv6 address.
If we take a hash of the first 64 bits only, then this means that all the
users in one particular network will map to the same public IPv4 address.
This is in any case what users expect when sitting behind a traditional NAT.
Such mappings would rarely change (only when IPv4 prefixes are added to or
removed from the translator cluster).
## Distribution of source networks
Given a particular usage, the consistent hash will aim to spread usage
evenly over the available IPv4 addresses. For example, if there are 4
million users and 1 million IPv4 addresses, on average each address will be
in use by 4 users.
Some addresses will be used by fewer, and some by more, although the
probability of an individual address being used by (say) 6 or more users
will be low.
TODO: [Do the math](http://stats.stackexchange.com/questions/43575/random-balls-in-random-buckets-what-are-the-characteristics-of-the-distribution)
## Source port requirements
Empirically, we know that a typical office or school network with NAT44
normally has a single public IPv4 address, and it works fine. If we take it
as good practice that a layer 2 broadcast domain (subnet) has up to 250
devices on it, then we believe those 250 devices happily share a range of
around 64,000 ports. If all were active at the same time, they would be
using 256 ports each on average; if only a quarter were active at the same
time then they would be happy with an average of 1024 ports each. This also
ties up with our experience of client devices: if you type "netstat" on a
client device it would be rare to see many hundreds of open sockets.
This means that in principle, an individual end user or small home network
might be happy with an allocation of maybe as little as 1024 ports. However
a larger office or school network (also a /64) may require much more.
## Source port selection algorithm
Our port allocation strategy has to allow for this, whilst ideally
maintaining separate port ranges for each user. Here is a proposed
approach.
The port space is divided into 64 blocks of 1024 ports. Block 0 is reserved.
The remainder are split into two ranges, "dedicated" and "shared".
~~~
+-----------------+
| 63 |
| ... SHARED |
| D+1 |
+-----------------+
| D |
| ... DEDICATED |
| 1 |
+-----------------+
| 0 Reserved |
+-----------------+
~~~
This is a static, system-wide split. For example, D=31 gives ports
1024-32767 in the dedicated range, and 32768-65535 in the shared range.
Dynamic port allocation is signalled using the following values in portsel:
the value would likely be fixed in the stage 1 servers.
~~~
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 0 0 0 0 0 L L | N N N N N N | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
~~~
* N = the number of 1024-port ranges which may be used in the dedicated
range (or in the shared range if the dedicated range is full), for
each distinct /64 prefix (N >= 1)
* L = the number of additional port ranges which may be used in the
shared range, per distinct /64 prefix
* 00 = none
* 01 = (63-D)/4
* 02 = (63-D)/2
* 03 = 63-D, i.e. the entire shared range
* P = prefix selector (from before)
The algorithm works as follows.
* Each of the 63 port ranges is either "unused" or "in use".
* When activity is first seen by a new /64 prefix, it is assigned an
unused port range from either the dedicated or shared ranges, preferring
the dedicated range if available.
* When this port range is full, this repeats until it has N allocations.
* Beyond that, additional allocations may only be made in the dedicated
range (D+1 to 63), up to the limit defined by L.
* Allocations in the shared range prefer unused ranges, but may also
be shared (preferring ranges which are currently allocated to the
lowest number of prefixes). When given a choice of in-use ranges,
select the range with the most free ports.
State can be conveniently represented using bitmaps in a 64-bit word:
* Each IPv4 address has bitmap indicating which port ranges are in use
* Each active IPv6 /64 prefix has a bitmap indicating which ranges are in
use by that prefix
* A fixed bitmask can be used to identify shared port ranges
These flags are "sticky"; that is, once a range has been allocated to a
prefix, it remains allocated. At the end of a period (suggested to be 24
hours) they are flushed to disk and reset, but remain set for any NAT
sessions which are still active at that point in time.
Now let's look at the characteristics of this algorithm when given an
aggressive sharing ratio of 16:1 for /64 prefixes to IPv4 addresses, and
D=31, N=2, L=3.
* Each /64 prefix gets a fixed IPv4 address from the consistent hash, and an
initial port range of 1024 ports in the dedicated area
* It may expand into a second range of 1024 ports in the dedicated area;
most of the time this will succeed. (Using a 16:1 sharing ratio,
sometimes more than 16 networks will be on the same IPv4 prefix, but it
is likely that some are low utilisation which won't go beyond their
initial 1024 ports)
* Beyond this it may use up to an additional 32768 ports, but these may be
shared with other "busy" networks
* Busy networks can use up to 34816 ports in total
* If there are multiple busy networks using the same IPv4 address, there may
be overlap in their port usage.
* Networks which continue to use the Internet are likely to have continuity
in their assigned port ranges
A less aggressive approach might target (say) a 4:1 ratio with different
parameters.
The upshot of this is that activity from IPv4 address X port Y can be mapped
with a good degree of certainty to a small number of IPv6 source /64
prefixes, usually one.
This gives a "pseudo A+P" architecture, where we have not statically
allocated exactly one customer to each port range, but the number of users
per port range is small.
### UDP and TCP
Separate port range assignments could be maintained for UDP and TCP for each
/64 prefix, so that heavy port pressure from one protocol does not
unnecessarily consume port ranges for the other protocol.
If desired, the stage 1 translator could even pass different parameters in
the portsel field for UDP and TCP, for different port allocation strategies.
The dedicated/shared split could also be different per protocol.
On the other hand, it would be simpler to use the same set of port ranges
for UDP and TCP; it halves the size of the data structures, and avoids
confusion in cases of querying "who was using port X?" when it's not clear
which protocol was involved.
## Logging and traceability
For abuse tracking and law enforcement purposes, it is necessary to be able
to trace activity back to the source IPv6 address. It is assumed this can
be weakened to "the source /64 network", since if it were an IPv4 network it
would probably be sharing a single IPv4 address with NAT44 anyway.
The algorithms outlined above make it unnecessary to have copious
per-session logs.
The only logging necessary is note each active IPv6 /64 prefix seen, along
with the translated IPv4 address and port range(s) used by that prefix.
This could be done separately for each period of (say) 24 hours. For that
period you would record:
* Each active IPv6 /64 prefix seen
* The translated IPv4 address
* The port range(s) used - e.g. as a 63-bit bitmap with one bit for each
range of 1024 ports
If each of these records takes 32 bytes, then 64M active prefixes would only
take 2GB, and this would be flushed to disk once per day.
A query regarding suspicious activity from source address X and source port
Y (at time T) could then be mapped back: often to a single source, sometimes
2, very rarely 3 or more, without expensive logs.
The fact that there may be more than one upstream match is hoped not to be a
problem in practice. If there are two possible sources, then the
investigation needs to try them both. Often the context will quickly make
it clear which is the right one. The culprit may claim "reasonable doubt"
but it's only necessary to exclude the other one.
## Configuration
* All of the stage 1 servers need to know what IPv4 prefixes are available
in the entire cluster. It's important that this information is
*consistent* across all stage 1 servers, so that the each server
chooses the same IPv4 address for each source IPv6 prefix
* The network infrastructure needs to know which IPv4 prefix(es) to route
to each stage 2 server.
* The stage 2 servers need to be configured with the cluster's public IPv6
prefix, so that in step (6) they can construct a source address directly
without having to return traffic via the stage 1 servers for another
translation.
There are several ways this could be done. The simplest way would be to
statically configure the list of IPv4 prefixes on each stage 1 server, and
use static routes to direct traffic to the stage 2 servers. This could use
any bulk system administration tool or a shared configuration directory.
Better would be to statically configure IPv4 prefixes on the stage 2
servers, have them announce both the IPv4 and mapped IPv6 versions to the
network; the stage 1 servers could then learn the full set of IPv4 prefixes
from the mapped IPv6 route announcements.
Yet another way would be to have full central control, assign each server
just a loopback address, and distribute all the routes in iBGP from a
central control node (in the same way that some ISPs distribute customer
static routes).
~~~
STAGE 1 STAGE 2
+---+ +---+
RTRS ..... | |+ ... | |+ ..... RTRS
^ +---+|+ +---+|+ ^
| +---+| +---+| |
| +---+ +---+ |
\ ^^^ /
\ \\\ /
\ +----------+ /
`------- | route |------'
|reflectors|
+----------+
^
| iBGP
control
node
~~~
Routes distributed via iBGP:
* fd00:0000:c000:0200/58 next hop=S2a loop6
* fd00:0000:c000:0240/58 next hop=S2b loop6
* fd00:0000:c000:0280/58 next hop=S2c loop6
* fd00:0000:c000:02c0/60 next hop=S2c loop6
* 192.0.2.0/26 next hop=S2a loop4
* 192.0.2.64/26 next hop=S2b loop4
* 192.0.2.128/26 next hop=S2c loop4
* 192.0.2.192/28 next hop=S2c loop4
This approach would allow a central control panel to move IPv4 blocks
between stage 2 servers, without having to login to or reconfigure the
servers themselves. The stage 1 servers would learn the full set of
available IPv4 prefixes from the same announcements (i.e. the IPv6
intermediate prefixes routed to the stage 2 servers) and use this both for
routing traffic, and for learning the set of available IPv4 prefixes for the
consistent hash algorithm.
# Additional implementation options
## Mixed static and dynamic operation
It's perfectly reasonable to use a static mapping for certain specified IPv6
prefixes, and use dynamic mapping for everything else.
All that is needed is that any IPv4 addresses which are used for static
mapping are excluded from the consistent hashing algorithm.
In addition, static IPv6 mappings may be for mixed and overlapping prefixes,
for example a single host (/128) could have its own dedicated port range,
whilst other hosts in the same /64 could share a different range. This
would be implemented as a longest-matching-prefix rule.
## Source port selection and NAT type
Subject to any chosen port range, the translated source port is then
entirely up to the stage 2 NAT64. The essential constraint is that each UDP
or TCP session must have a unique tuple of (translated source address,
translated source port, destination address, destination port). The NAT64
has full freedom in the choice of translated source port, but the other
three values are fixed.
If a translated source port is dedicated to a particular tuple of (original
source address, original source port) then this makes a "cone NAT", and this
is able to support certain direct peer-to-peer traffic patterns (e.g.
STUN/ICE) which would be extremely helpful for certain applications.
However it increases the pressure on the limited pool of available source
ports.
Alternatively, source ports can be reused when talking to a different
destination address and/or destination port. This gives a "symmetric NAT"
behaviour, which does not support these peer-to-peer applications.
Some possible compromises include:
* Use symmetric NAT behaviour for TCP
* Use cone NAT for UDP, except when available ports are exhausted
* Use symmetric NAT only in in the shared port range and only when there
are multiple prefixes using that range
## Positioning of destination IPv4 address
The examples so far have shown the translator advertising a /96 prefix with
the destination IPv4 address in the last 32 bits.
The NAT64 numbering scheme in RFC6052 allows the target IPv4 address to be
carried higher up in the IPv6 address than the last 32 bits. For this
application there seems no particular reason to do this. If it were done,
it would require moving the destination IPv4 address down to the last 32
bits anyway during the formation of the intermediate address.
## Use for NAT44
This architecture is also easily adapted for NAT44, by making the first
stage a stateless NAT46, using all 32 bits of the IPv4 source address for
the consistent hash, and transforming the source address to
`::ffff:x.x.x.x`. There is no need for the prefix selector in this case.
The stage 1 hosts will need both IPv4 and IPv6 interfaces. The stage 2
hosts can dump the return traffic directly into IPv4, through their
existing IPv4 interface.
# Scaling
As there is no shared state, additional stage 1 boxes can be added at will
(as either CPU or port bandwidth limits are approached). The only
requirement is that the network can distribute incoming traffic evenly
across them; this may be done by equal-cost multipath.
On the stage 2 boxes, there are two cases:
* If the sharing ratio on the IPv4 pools is becoming unacceptably high,
then additional IPv4 prefixes can be added - either by adding new servers
with these prefixes, or breaking up the new prefixes and distributing them
across existing servers. (This makes no difference to the source address
selection algorithm, as it doesn't care which server manages which address)
* If the load on the boxes is becoming unacceptably high, then additional
boxes can be added, either with their own new IPv4 prefixes, or by taking
some IPv4 addresses away from existing boxes.
Any such changes would result in the consistent hashing algorithm
redistributing a proportion of users onto new IPv4 addresses or servers, and
for those users, any ongoing sessions would be interrupted. Therefore it is
desirable to make such changes only occasionally, perhaps at well-known
(weekly?) maintenance times.
The ideal size of IPv4 pool per server would most likely be learned through
experience, and hence scaling of stage 2 could be done by adding new servers
with new IPv4 pools of the correct size, leaving other servers unchanged.
It is not necessary for the stage 2 servers to be homogenous:
higher-performance servers can be given larger IPv4 pools than the others.
It is desirable for the stage 1 servers to be similar, but only because it
is may not be easy to configure multipath load-balancing to weight traffic
to different destinations.
## Sizing calculations
Suppose we take the following back-of-envelope parameters:
* 16:1 IP sharing ratio
* 0.1Mbps average translated traffic per user
Then a /23 IPv4 block (512 addresses) would amount to about 820Mbps of
traffic, suitable for a stage 2 server with 1Gbps ports.
If we decided that the IP sharing ratio should be only 8:1, then the same
server would require a /22 IPv4 block (1024 addresses).
A server with 10Gbps ports could be assigned ten times as many IPv4
addresses - unless it were CPU or RAM-bound, in which case the number of
IPv4 addresses would be in proportion to its actual maximum processing
capability.
Note that Stage 1 boxes are handling traffic in one direction only, and so a
single NIC would be equally utilised in both "in" and "out" directions.
Stage 2 boxes are handling traffic in both directions; hence if the expected
traffic is similar in both directions then separate ports for IPv4 and IPv6
would be beneficial.
In practice, the traffic-per-user figure would have to be learned by
experience, and may change over time as different types of users start using
the NAT64 service. Having (say) 8,000 users (= 512 addresses x 16) going
through one server would hopefully smooth out most peaks; however it's
possible that if a handful of high-volume users suddenly make big demands
this would result in a spike, and therefore it may be wise to aim for a
lower steady-state peak.
Servers do not have to be given equal-sized or contiguous pools. Indeed it
is desirable to break addresses down into smaller chunks to give more
fine-grained control. For example the 1G server could initially be given 32
x /28 blocks and the 10G server 320 x /28 blocks; later on the 1G server can
have individual /28 blocks added or taken away (moved to other servers).
This minimises the impact on end-users, as the consistent hashing change
will only affect a small proportion of the users on that server.
## Consistent hash table implementation
The size of the consistent hash data structure increases with the size of
the IPv4 pool, and has to be available for each stage 1 server to consult.
To take a moderately large scenario, let us consider:
* A total IPv4 space of /12 equivalent (1M addresses; at 16:1 this is
enough for 16M concurrent users)
* M=256, i.e. each address appears 256 times on the CH ring
* An efficient data structure taking an average of 16 bytes per entry.
(Each entry maps a 64-bit hash key to a 32-bit IPv4 address, making 12
bytes, and there will be overhead in the data structure too. However
the key space is well balanced, and an efficient implementation would
be able to share prefixes of the key)
This would require a total data structure with 256M entries and memory usage
of 4GB, which is certainly feasible.
Given such a large pool of address space, M=256 may not be necessary; M=64
may give sufficiently even balance.
With a total IPv4 address space of /8 (surely more than a single cluster
would ever have!) and M=64, the RAM requirement is still only 16GB.
The data structure would have to be designed for efficient storage and
lookup, but this is a well-explored area, and if the speed of lookup is
the limiting factor then more stage 1 boxes can be added.
The essential requirement is to be able to search for a particular key, and
locate the key/value immediately preceding that key. Judy Arrays may be a
suitable choice: `judyl` maps a 64-bit index to a 64-bit value, and the
function `JLL()` will locate the last index equal to or less than the one
given.
(TODO: prototype judyl and measure its memory usage and lookup performance.
[This article](http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/)
suggests average memory usage at or below 14 bytes per item, and lookup
times of around 500ns on a Core 2 Duo, as long as the cache is not under heavy
external pressure)
## Processing
At an average packet size of 512 bytes, 820Mbps of traffic is 200K packets
per second. Although it would be cost-effective if a single box could
achieve this, the horizontal scalability makes this moot.
Also, as the size of the user base increases, the rate at which sessions are
created and destroyed goes up. This is also divided across the available
stage 2 boxes and can be scaled accordingly.
## Equal-cost multipath
Since additional stage 1 and stage 2 boxes can be added as required, the
remaining scaling limitation is the ability of the network to distribute
incoming traffic amongst a large number of stage 1 boxes. Existing network
devices may have inherent limits as to the number of destinations they may
distribute between.
This could be addressed by having multiple tiers of routing: e.g. tier 1
distributes amongst N routers, each of which in turn distributes amongst M
destinations.
Note that only a single IPv6 prefix needs to be handled in this way: the
translator cluster's overall IPv6 prefix (e.g. NPFX::/96), or possibly a
small number of prefixes if the cluster supports multiple IPv6 prefixes.
## Interconnect
The interconnect (switched and/or routed) between stage 1 and stage 2 has to
be able to carry the entire traffic volume. Since this is unicast, and no
more than the total traffic entering the translator cluster, this is no harder
to build than delivering the required traffic to the translator cluster in the
first place.
For Internet-scale deployment, there would be multiple, independent
translator clusters dotted around the Internet. This is the subject of a
[separate paper](candler-interconnecting-the-internets.md.html).
# Management
## Failover
If a stage 1 box fails, traffic will simply be redistributed over the other
stage 1 boxes (as soon as the network load balancer detects this) and there
will be no impact.
If a stage 2 box responsible for a particular IPv4 range fails, then traffic
for those users will be redistributed across the remaining IPv4 address
space by the consistent hashing algorithm. This will keep the cluster
balanced, but will interrupt any ongoing sessions for those users.
Alternatively, it would be possible to run servers in pairs: one server is
primary for block A and backup for block B, and the other is primary for
block B and primary for block A. While they run they keep their state tables
in sync for both ranges, so that if one fails, the other can take over
immediately. The OpenBSD "pfsync" mechanism provides an example of how this
can be implemented.
This mechanism may make such failures less noticable, at least at off-peak
times when servers are below 50% capacity; it could also be useful for
performing maintenance. In practice: unscheduled failures may be
sufficiently rare for this not to be a problem.
## Scheduled maintenance
For scheduled maintenance, all that is necessary is to have a few spare
stage 2 hosts, and to be able to sync the to-be-maintained host's NAT64
state with a spare host, before failing over.
Since the traffic cannot be swung instantaneously, ideally the states should
remain in sync bi-directionally while the IPv6 traffic (stage1 to stage2)
and external IPv4 traffic (Internet to stage2) is rerouted.
## Changes of IPv4 pools
When IPv4 pools change, some existing sessions will need to be interrupted.
It would be helpful if the stage 2 translator could send a RST for existing
TCP sessions when IPv4 pools are removed from it (and then remove its state
entries), unless it has a failover partner.
# Security considerations
## Direct use of intermediate addresses
If an end-user were able to send traffic to the stage 2 intermediate address
prefix, they would be able to select an arbitrary IPv4 source address
(and/or port) for their outgoing traffic.
Hence this should blocked, for example by ACLs at the edge, or by making the
stage 1 to stage 2 interconnect a completely separate routing domain. Using
ULA addresses is also helpful for this reason.
Note that if a single AS contains multiple translator clusters, it would be
wise for each cluster to use a distinct intermediate prefix (especially if a
single iBGP mesh includes all translators)
## Denial of Service
Any NAT device is sensitive to DoS, particularly explosion of the state
table, and the stage 2 NAT64 in this design is no different.
### Spoofed token (lower 64 bits)
IPv6 allows the sender to choose any of 2^64 possible source addresses
within a prefix. This is a fundamental feature of the current IPv6
addressing architecture.
So whilst it would be desirable to keep statistics on utilisation for each
individual /128 address, if an attacker wants to hide her usage she can
simply continue to pick random source addresses until the NAT is no longer
able to keep track.
She can also respond to traffic to all those addresses, e.g. to complete a
3-way TCP handshake. The NAT therefore has no way to distinguish between
genuine and spoofed traffic.
To protect itself, the NAT will need to limit state generation at the level
of the /64 prefix, which means the attacker will be performing a DoS against
other users on her own network. This can only be traced by the local
network administrator, e.g. by looking at NDP tables.
### Spoofed prefix (upper 64 bits)
Unfortunately, many service providers do not have ingress filters to prevent
source address spoofing, and so the incoming source addresses arriving at
the translator may be completely arbitrary.
The problems this can cause include:
* Creation of many useless translation state entries
* Exhaustion of source ports
* False logging of IPv6 prefixes as "active", and thus junk mappings of
IPv4 address/port to spoofed IPv6 addresses
To avoid false logging and allocation of a port range to a spoofed prefix,
IPv6 prefixes should only be marked "active" after at least one successful
three-way TCP exchange.
To avoid the useless state entries and source port exhaustion, the stage 2
NAT may need to engage some mechanism similar to "SYN cookies" so that
long-lived NAT state is not created until after a successful three-way TCP
exchange.
UDP traffic cannot be protected in this way, as we have no way of knowing
whether return UDP traffic was successfully delivered or not. More
heuristic methods may be required.
It could be said that few devices would use UDP without any TCP at all;
therefore the successful establishment of TCP from a given IPv6 address
could whitelist that address for UDP as well. However if an attacker
obtains or guesses a valid source IPv6 address then they can spoof traffic
which is indistinguishable from genuine traffic from that address. It may
therefore also be necessary to limit the rate of UDP state creation or the
total number of UDP states per source.
Some devices (e.g. SIP phones) may use UDP exclusively - although SIP is
unlikely to work well with NAT64 anyway. If we allow the successful
establishment of TCP from anywhere in a /64 prefix to whitelist the whole
prefix for UDP, this is unlikely to be a problem.
### State and port exhaustion
Even without address spoofing, a client can create a large number of TCP
sockets and a large number of UDP sockets, and consume resources on the
translator.
If a cap is set at the limit of the /64 prefix, then the user will be able
to perform a DoS against other users in their own network.
If a cap is set at the limit of the /128 address then this can be avoided,
however the attacker can easily circumvent this by choosing different source
addresses as described above.
### Long-lived TCP sessions
TCP sessions can hold state for an extended period of time, especially if
the client or server vanish, and may increase utilisation towards the cap.
Hence stale sessions must be pruned, at least in times of high demand.
(TODO: can the NAT64 inject TCP keepalives even if the endpoints themselves
are not using it?)
### UDP sessions
If a client binds to one socket and sends to many destinations, we SHOULD
use the same translated source port, so that STUN/ICE can work. However if
there is much churn of client sockets, there could be much pressure on the
available port space, and may have to fall back to shared port use
(symmetric NAT).
It is probably realistic to time out UDP translations after 30-60 seconds of
inactivity. Clients have an expectation of having to refresh NAT state -
although if they are on an IPv6-only network they may not realise that some
of their traffic is going via a NAT64.
Ports could be re-used in a LRU order, but this would make problems harder
to debug - it is probably better to have a fixed UDP timeout.
### DNS
A common source of UDP state is DNS. There is no good reason for anyone to
use NAT64 to translate DNS queries. An IPv6-only user should be talking to
a DNS(64) cache over IPv6, and that cache should be dual-stack. Anything
else is misconfiguration.
Therefore it is perfectly reasonable to block UDP and TCP port 53 entirely
at the translator - or to return a canned DNS response with the fixed IP
address of a webserver which explains the problem, essentially a captive
portal.
### ICMP
It is helpful for the NAT64 to work with ICMP echo request. This would mean
that an end-user with a CLAT would be able to do "ping 8.8.8.8" and get a
response - this means "The Internet Is Working [TM]".
Such state can be very short-lived (of order of 5 seconds) and the number of
concurrent states from a given prefix can be limited, and/or traffic heavily
rate limited.
### Small packets
A system tuned to handle a certain traffic volume under the assumption of an
average packet size of (say) 512 bytes per packet may become overwhelmed
given a stream of small packets of 64 bytes, as this will demand 8 times the
processing.
### Statistics
There should be statistical monitoring of both traffic (bps and pps) and
state generation from active prefixes (aggregated per /64, per /56, per /48,
and per /32), and the ability to apply temporary blocks if required.
## Issues with dynamic A+P
A static A+P deployment will explicitly tie each address/port combination
back to one source, but this may not always be true for dynamic A+P.
### Selection of source IPv4 address
An attacker with a /48 route from their ISP can choose whichever public IPv4
address from the pool they want (or at least, any one of 65,536 choices),
simply by rotating through their 2^16 available prefixes. If the hash
algorithm and IPv4 pool ranges are public they can even do this off-line.
This could be used to make an attack appear like it is coming from multiple
sources, when they are in fact the same source; it can divide a large volume
of traffic into 2^16 smaller streams.
If the target queries each of the IPv4 addresses from the translator logs,
they will find that all those addresses include mappings to prefixes within
the same /48 range, and may be able to infer the true source of the attack.
Prefix selection can also be used to purposely make the attacker's traffic
come from the same IPv4 address used by a different, trustworthy network.
To some extent this is inherent to the concept of address sharing, but in
this case the attacker is allowed to select their sharing partner.
If the translator cluster has less than a /16 of address space in total then
the attacker will be able to find multiple prefixes which map to the same
IPv4 address and consume an unfair share of dedicated ports on that address.
To address this issue, we could consider using only the first 48 bits of the
source IPv6 address in the consistent hash algorithm. The problem this
would cause is when there is a genuine large site with a /48 block (say, a
university): we do not want every single network in the university to map to
the same IPv4 address, as this would create excessive load in a single stage
2 server (traffic load, state, and demand on available source ports).
We could also consider using 56 bits, given that many ISPs are allocating
/56's to end users. Such a user would have no choice over their translated
IPv4 address, and a user with a /48 would only be able to choose between 256
of them. However a large network like a university might have to reorganise
their prefixes to distribute load among those 256 available translated
addresses.
This point remains open to discussion, but from a basic engineering point of
view it is still preferable to use 64 bits of the prefix to give an even
distribution of addresses for larger client networks.
In any case, users may easily obtain additional /48 blocks (e.g. from
tunnel brokers) or even a /32 or more by joining a RIR. At worst there is
always the option of blocking traffic from any ranges causing persistent
abuse. If there is a need for them, RBL-style blacklists for IPv6 will
spring up.
### Selection of source port
The port selection algorithm is designed to allow "busy" networks to make
use of a large number of ports in a shared range.
An attacker can easily open multiple sockets bound to multiple addresses and
create as many ports as they wish. This would be a denial-of-service
against their own network.
If the particular IPv4 address has one or more legitimate, "busy" networks
on it, then the attacker may end up using some port ranges which are shared
with those networks. This would be intended to mislead an investigation.
However at worst it would only increase the number of leads which have to be
followed - the information that the attacker was using a particular port
range would not be lost, only that there are multiple possible users of that
port range.
Even if there is a 16:1 sharing ratio, then at worst 16 networks would be
using the same extended port ranges; in practice it is likely to be far
lower.
### Dense networks
In some cases a single /64 prefix may be supporting many more than 250
devices (e.g. a large hotel, or a conference wireless network). In this
translator design, they will all be mapped to the same IPv4 address and so
will be sharing a single port range, which will suffer severe pressure - as
indeed happens today if the hotel built their network with NAT44 and a
single public IPv4 address.
It would be better engineering if the hotel were to divide their network
into subnets, which would spread the load across multiple IPv4 addresses,
or even route a separate /64 to each room.
The mitigating factor here is that if the hotel has built a pure IPv6-only
network, then at least connections to dual-stack destinations will continue
to work just fine, even if IPv4 ports are exhausted in the translator,
rather than suffering a total network collapse.
# References
TODO: Use the proper NAT64 terminology throughout
* RFC1071/1141/1624: Checksum algorithm
* RFC5254: ICE
* RFC5389: STUN
* RFC6052: IPv6 Addressing of IPv4/IPv6 Translators
* RFC6144: Framework for IPv4/IPv6 Translation
* RFC6145: IP/ICMP Translation Algorithm
* RFC6146: Stateful NAT64
* RFC6147: DNS64
* RFC6296: IPv6 NPT (NAT66)
* RFC6877: 464XLAT
* RFC7225: NAT64 prefix discovery with PCP
* RFC7269: NAT64 deployment options and experience
* Judy arrays:
* NAT types: