**An Architecture for Hyperscale NAT64** Brian Candler, DRAFT 2016-03-14 # Abstract This document outlines a design for a NAT64 translator cluster which can scale to arbitrary traffic volumes, as would be required for translating a non-trivial proportion of Internet traffic. # Motivation Commercial large-scale or "carrier-grade" NAT64 implementations are available, but they are proprietary, expensive, and of uncertain scalability. The design outlined here has the following characteristics: * Horizonally-scalable: simply add more boxes as traffic, CPU and/or RAM limits are hit * Add more IPv4 address pools as concurrent client usage increases * Low disruption to existing flows as pools are grown * Deterministic selection of translated IPv4 address and dynamic selection of port range * Minimal logging of translated flows required for abuse tracking and law enforcement * Suitable for implementation on commodity hardware # High-level design ## Two-layer architecture A collection of hosts logically form a single "translator cluster", which translates IPv6 traffic to destination addresses of the form NPFX::z.z.z.z to IPv4 address z.z.z.z (where NPFX is the chosen IPv6 prefix for the cluster) Internally, the cluster comprises a number of hosts grouped in two stages: a NAT66 stage and a NAT64 stage. (Combining both functions into a single host is perfectly possible but not considered here) ~~~ s USER:a s USER:a d STG2:x:y:z s x (IPv4) d NPFX::z +---+ ,----------------->+---+ d z (IPv4) ------------->| | --.,--------------->| | <----------> +---+ \/\ +---+ +---+ /\ `-------------->+---+ ------------->| | ---\--------------->| | <----------> +---+ \ \ +---+ STAGE 1 \ `------------->+---+ `--------------->| | <----------> +---+ STAGE 2 ~~~ The first stage is a *stateless*, deterministic selection of IPv4 source address as a function of the IPv6 source address (USER:a). All stage 1 hosts are identical. The destination IPv6 address is rewritten to an intermediate form which contains both the selected IPv4 source (x) and target destination IPv4 address (z). These intermediate addresses lie within a internal prefix STG2::/32 IPv6 routing then naturally delivers the packet to the correct stage 2 box which announces the prefix covering STG2:x::/64. The second stage is a traditional *stateful* NAT64, except that the source IPv4 address (x) has already been selected and is extracted from the IPv6 destination address. Each stage 2 box "owns" its own range or ranges of IPv4 addresses, and regular IPv4 routing is used to ensure the return traffic hits the right box. After conversion back to IPv6, the return traffic can be dumped straight back into the IPv6 network, and need not touch the Stage 1 boxes again. Scalability is therefore achieved by: - building sufficient stage 1 boxes for the source selection algorithm; - breaking up the total IPv4 pool into small enough pieces that a single stage 2 box can handle the required proportion of overall traffic. ## Format of intermediate addresses The intermediate addresses are rewritten inside a /32 IPv6 prefix. This need not be globally routable: ULA addresses are fine. The example here uses fd00::/32 These addresses are structured as follows: ~~~ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | STG2 prefix | IPv4 source |chkcomp|portsel| IPv4 dest | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ fd 00 00 00 (example) ~~~ The field "chkcomp" is used to compensate for changes in the upper layer checksum which would otherwise be required, using the algorithm in RFC6296. The field "portsel" may be used to influence stage 2 translator operation, and is described later but is not required for basic NAT64 operation. Both fields are shown as zero in the following exposition. ## Example of operation Let's say we have three stage 2 servers: * S2a has IPv4 pool 192.0.2.0/26 * S2b has IPv4 pool 192.0.2.64/26 * S2c has IPv4 pools 192.0.2.128/26 and 192.0.2.192/28 (perhaps because it is a newer and more powerful box, and we wish it to take a higher share of traffic) This gives a total of 208 IPv4 translated addresses available. Each S2 server announces an IPv6 prefix or prefixes with the IPv4 prefix(es) it owns shifted left 64 bits (i.e. in bits 32-63 of the IPv6 address). These announcements are made towards the S1 servers. * S2a announces `fd00:0000:c000:0200::/58` * S2b announces `fd00:0000:c000:0240::/58` * S2c announces `fd00:0000:c000:0280::/58` and `fd00:0000:c000:02c0::/60` To clarify the S2b route: ~~~ fd00:0000 : c0 00 : 02 40 : 0000:0000:0000:0000 11000000 00000000 00000010 01000000 ---------------------------------------->| /58 is equivalent to fd00:0000 : 192 . 0 . 2 . 64 : 0000:0000:0.0.0.0 11000000 00000000 00000010 01000000 ---------------------------->| /26 ~~~ Let's assume the IPv6 translator prefix announced to the public IPv6 Internet is 2001:db8:6464::/96. Now consider what happens when an IPv6 packet arrives with source 2001:db8:1234::5 and destination 2001:db8:6464::203.0.113.99. 1. The packet arrives at any one of the stage 1 servers 2. The stage 1 server looks at the source IPv6 address, and uses this to select one of the 208 available IPv4 addresses deterministically. Let's say it picks 192.0.2.100 3. The stage 1 server rewrites the destination IPv6 address as `fd00:0000:192.0.2.100::203.0.113.99`, more correctly represented as `fd00:0000:c000:0264::203.0.113.99`, and sends it out. 4. This packet arrives at whichever stage 2 server which is announcing the corresponding IPv6 prefix, which in this case is S2b. (Observe that `fd00:0000:c000:0264::203.0.113.99` is within `fd00:0000:c000:0240::/58`) 5. S2b goes through a regular NAT64 process, but choosing the IPv4 source address from bits 32-63 of the destination IPv6 address. It sends out an IPv4 packet with source 192.0.2.100 and destination 203.0.113.99, and creates internal state. This packet enters the IPv4 Internet. 6. The return packet arrives back at S2b, is looked up in the state tables, untranslated to source 2001:db8:6464::203.0.113.99 and destination 2001:db8:1234::5, and sent on its way to the IPv6 Internet. 7. Future IPv6 packets for this session can hit any stage 1 box, but as long as the address selection algorithm is consistent, they will be directed to the same stage 2 box as before, which is where the correct state is stored. The stage 2 operation is slightly different to standard NAT64 in two ways. Firstly, it needs to extract a specified source IPv4 address from the destination IPv6 address (step 5). Secondly, it constructs response packets using the translator's public prefix (step 6). This latter step is an optimisation to avoid packets returning via the stage 1 translators. The above example is only slightly simplified: it ignores the checksum compensation (in step 3); it assumes that source port selection is entirely at the discretion of the stage 2 box (in step 5); and that the cluster has only one public IPv6 translator prefix (in step 6). Those details will be presented next. ## Checksum compensation Bits 64-79 of the translated destination address (the "chkcomp" field) are set using the algorithm in sections 3.2 and 3.5 of RFC6296, to make the overall stage 1 translation checksum-neutral. This is a simple calculation which avoids any further need for the stage 1 translator to inspect or update the upper-level protocol checksum (e.g. TCP/UDP). ## Multiple translator prefixes It is possible the translator cluster operator will want to serve multiple IPv6 NAT64 prefixes using the same cluster. Examples are: * It might want to announce both its own unique block `2001:db8:6464::/96` and an anycast block like `64:ff9b::/96` * In the event of one translator cluster failing totally or being taken off-line, it may be desirable for a different translator cluster to announce the prefix of the failed one in addition to its own This can be easily handled, but the stage 2 translator needs to know which prefix was used for any particular session, so that the return traffic can be rewritten with the right IPv6 source address applied (in step 6 of the worked example). The mechanism proposed is to use the lower two bits from the "portsel" field for this (bits 80-95), which is the remaining 16-bit field in the intermediate translated address. ~~~ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | | P P | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ 80 94 95 ~~~ Both the stage 1 and stage 2 translators are configured with a static mapping table of up to four public IPv6 prefixes. The stage 1 translator inserts the P bits in the intermediate address, dependent on the original destination IPv6 address seen (before it is rewritten into the intermediate form). The stage 2 translator records them along with the session state; and when incoming IPv4 packets are received, uses this state to select the appropriate IPv6 source prefix for return packets. (An alternative would be to use high bits from the STG2 prefix, e.g. bits 16-31) # Static A+P configuration Some ISP/carrier configurations may wish to use an explicit static mapping from IPv6 prefixes to IPv4 source address and port range. This can be configured simply using a static prefix lookup table, distributed to each of the stage 1 servers. The stage 1 server selects both an IPv4 address and a port range for each IPv6 prefix. The remaining "portsel" bits in the intermediate IPv6 address are used to signal the selected port range to the stage 2 host. The exact use of these bits is up to the design of the translator cluster, but here is a simple approach: ~~~ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | B B B B B B | 0 0 | N N N N N N | P P | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ~~~ A value of all zeros in all of the first 14 bits (B=0, N=0) means that the stage 2 translator is permitted to use the whole range of source ports (1024-65535). Otherwise it breaks down as: * B: concatenate with 10 zeros to find start of port range (B>=1) * N: size of port range is N*1024 ports (N>=1) Ports 0-1023 are never permitted. If the end of the port range would extend beyond 65535 it is truncated. This allows port ranges of between 1024 ports up to the full range. Example: ~~~ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 1 0 0 1 1 0 0 0 | 0 0 0 0 1 0 0 0 | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ~~~ * Port range is 0x9800 - 0x9fff (38912 - 40959): 2048 ports * (Prefix selector = 00) # Dynamic A+P operation For Internet-scale use it is not practical to configure explicit per-client mappings. A zero-configuration approach is described here. ## Consistent hashing The approach by which a stage 1 server selects an IPv4 address to use is called *consistent hashing*. It works as follows. 1. Define a hash space; let's say a 64-bit space. You can consider this as a circle which starts from 0 to 2^64-1 and wraps around to 0. 2. Expand every available IPv4 prefix into individual IPv4 addresses, N in total. Each IPv4 address is hashed into this hash space M times with M different seeds. We therefore have N x M points on this circle, each point associated with a single IPv4 address. 3. (Part of) each IPv6 source address is also hashed into a 64-bit value, giving another point on the circle. 4. To find the IPv4 address to use, move back along the circle until you find one of the points made in step (2). The IPv4 address associated with this point is the one to use. The factor M provides better distribution of load and re-distribution of hashes if an IPv4 address is removed (or a new IPv4 address is added). With M=256, each IPv4 address covers 256 random segments of the circle. If the address is lost, then users who were assigned to those segments will instead pick up the points earlier on the circle, i.e. will be spread across approximately 256 other IP addresses. Similarly, if a new address is added, it will pick up a small proportion of users from other addresses. Crucially however, the *majority* of users continue to use their existing IPv4 address unchanged, and hence their NAT64 sessions are unaffected. The following example shows prefix 192.0.2.0/30 (4 IPv4 addresses), M=4, and a 16-bit hash space ~~~ *0fee [192.0.2.0] [192.0.2.3] ff7c* *3559 [192.0.2.1] [192.0.2.2] f9ee* *3d95 [192.0.2.1] [192.0.2.3] f763* *403e [192.0.2.3] [192.0.2.0] e482* *4ace [192.0.2.1] [192.0.2.2] e1c5* *4cdd [192.0.2.3] [192.0.2.0] e0c2* *5e23 [192.0.2.1] [192.0.2.2] e0bc* *bd91 [192.0.2.0] *d154 [192.0.2.2] ~~~ These values were obtained by taking the first 16 bits of the MD5 hash of each of the following strings: * "192.0.2.0|0" = bd91 * "192.0.2.0|1" = e482 * "192.0.2.0|2" = e0c2 * "192.0.2.0|3" = 0fee * "192.0.2.1|0" = 4ace * "192.0.2.1|1" = 3d95 * "192.0.2.1|2" = 3559 * "192.0.2.1|3" = 5e23 * "192.0.2.2|0" = d154 * "192.0.2.2|1" = f9ee * "192.0.2.2|2" = e1c5 * "192.0.2.2|3" = e0bc * "192.0.2.3|0" = f763 * "192.0.2.3|1" = 4cdd * "192.0.2.3|2" = 403e * "192.0.2.3|3" = ff7c Now suppose we get a packet with source address "2001:db8:1234::5". This hashes to 371c. The point in the circle before this is 3559, and therefore the assigned IPv4 address is 192.0.2.1. The actual hash doesn't really matter, as long as it gives a reasonable random spread. The example above turns out to quite badly distributed: some ranges are very large (5e23...bd91) and some tiny (e0bc...e0c2), and the overall distribution of IPv4 usage is: * 192.0.2.0: 0fee...3559, bd91...d154, e0c2...e1c5, e482...f763 = 19730 = 30.1% * 192.0.2.1: 3559...403e, 4ace...4cdd, 5e23...bd91 = 27746 = 42.3% * 192.0.2.2: d154...e0c2, e1c5...e482, f9ee...ff7c = 6073 = 9.3% * 192.0.2.3: 403e...4ace, 4cdd...5e23, f763...f9ee, ff7c...0fee = 11987 = 18.3% However, this is due to the small size of the example, and a larger factor M improves the spread dramatically. In this case, using the same 4 IP addresses but M=256 would give utilisation of each IP address of 24.32%, 24.19%, 25.16% and 26.34% respectively. In practice, a higher-performing algorithm than MD5 would be used, at least when processing the incoming IPv6 source addresses; and all the processing would be in binary not ASCII representation. A suitable hash algorithm would give good random spread and yet be efficient to implement on 64-bit processors and/or direct hardware implementations. (TODO: assess suitability of algorithms. MurmurHash3-128? CityHash64? Older functions like CRC-64-ECMA, Fletcher-64? Note: Nehalem/i7/SSE4.2 has a CRC32 primitive on-chip!) (TODO: if two addresses hash to *exactly* the same value, need to define which takes precedence) ## Masked IPv6 source bits It is not desirable to hash the whole 128 bits of source IPv6 address when selecting the IPv4 address to use. If we used all 128 bits, then all the users in any network will be spread evenly over all the available IPv4 addresses; the use of periodically-changing privacy addresses will ensure that one network will eventually make use of *all* available IPv4 addresses at some point. We can improve this by hashing only a portion of the source IPv6 address. If we take a hash of the first 64 bits only, then this means that all the users in one particular network will map to the same public IPv4 address. This is in any case what users expect when sitting behind a traditional NAT. Such mappings would rarely change (only when IPv4 prefixes are added to or removed from the translator cluster). ## Distribution of source networks Given a particular usage, the consistent hash will aim to spread usage evenly over the available IPv4 addresses. For example, if there are 4 million users and 1 million IPv4 addresses, on average each address will be in use by 4 users. Some addresses will be used by fewer, and some by more, although the probability of an individual address being used by (say) 6 or more users will be low. TODO: [Do the math](http://stats.stackexchange.com/questions/43575/random-balls-in-random-buckets-what-are-the-characteristics-of-the-distribution) ## Source port requirements Empirically, we know that a typical office or school network with NAT44 normally has a single public IPv4 address, and it works fine. If we take it as good practice that a layer 2 broadcast domain (subnet) has up to 250 devices on it, then we believe those 250 devices happily share a range of around 64,000 ports. If all were active at the same time, they would be using 256 ports each on average; if only a quarter were active at the same time then they would be happy with an average of 1024 ports each. This also ties up with our experience of client devices: if you type "netstat" on a client device it would be rare to see many hundreds of open sockets. This means that in principle, an individual end user or small home network might be happy with an allocation of maybe as little as 1024 ports. However a larger office or school network (also a /64) may require much more. ## Source port selection algorithm Our port allocation strategy has to allow for this, whilst ideally maintaining separate port ranges for each user. Here is a proposed approach. The port space is divided into 64 blocks of 1024 ports. Block 0 is reserved. The remainder are split into two ranges, "dedicated" and "shared". ~~~ +-----------------+ | 63 | | ... SHARED | | D+1 | +-----------------+ | D | | ... DEDICATED | | 1 | +-----------------+ | 0 Reserved | +-----------------+ ~~~ This is a static, system-wide split. For example, D=31 gives ports 1024-32767 in the dedicated range, and 32768-65535 in the shared range. Dynamic port allocation is signalled using the following values in portsel: the value would likely be fixed in the stage 1 servers. ~~~ +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ | 0 0 0 0 0 0 L L | N N N N N N | P P | +---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+ ~~~ * N = the number of 1024-port ranges which may be used in the dedicated range (or in the shared range if the dedicated range is full), for each distinct /64 prefix (N >= 1) * L = the number of additional port ranges which may be used in the shared range, per distinct /64 prefix * 00 = none * 01 = (63-D)/4 * 02 = (63-D)/2 * 03 = 63-D, i.e. the entire shared range * P = prefix selector (from before) The algorithm works as follows. * Each of the 63 port ranges is either "unused" or "in use". * When activity is first seen by a new /64 prefix, it is assigned an unused port range from either the dedicated or shared ranges, preferring the dedicated range if available. * When this port range is full, this repeats until it has N allocations. * Beyond that, additional allocations may only be made in the dedicated range (D+1 to 63), up to the limit defined by L. * Allocations in the shared range prefer unused ranges, but may also be shared (preferring ranges which are currently allocated to the lowest number of prefixes). When given a choice of in-use ranges, select the range with the most free ports. State can be conveniently represented using bitmaps in a 64-bit word: * Each IPv4 address has bitmap indicating which port ranges are in use * Each active IPv6 /64 prefix has a bitmap indicating which ranges are in use by that prefix * A fixed bitmask can be used to identify shared port ranges These flags are "sticky"; that is, once a range has been allocated to a prefix, it remains allocated. At the end of a period (suggested to be 24 hours) they are flushed to disk and reset, but remain set for any NAT sessions which are still active at that point in time. Now let's look at the characteristics of this algorithm when given an aggressive sharing ratio of 16:1 for /64 prefixes to IPv4 addresses, and D=31, N=2, L=3. * Each /64 prefix gets a fixed IPv4 address from the consistent hash, and an initial port range of 1024 ports in the dedicated area * It may expand into a second range of 1024 ports in the dedicated area; most of the time this will succeed. (Using a 16:1 sharing ratio, sometimes more than 16 networks will be on the same IPv4 prefix, but it is likely that some are low utilisation which won't go beyond their initial 1024 ports) * Beyond this it may use up to an additional 32768 ports, but these may be shared with other "busy" networks * Busy networks can use up to 34816 ports in total * If there are multiple busy networks using the same IPv4 address, there may be overlap in their port usage. * Networks which continue to use the Internet are likely to have continuity in their assigned port ranges A less aggressive approach might target (say) a 4:1 ratio with different parameters. The upshot of this is that activity from IPv4 address X port Y can be mapped with a good degree of certainty to a small number of IPv6 source /64 prefixes, usually one. This gives a "pseudo A+P" architecture, where we have not statically allocated exactly one customer to each port range, but the number of users per port range is small. ### UDP and TCP Separate port range assignments could be maintained for UDP and TCP for each /64 prefix, so that heavy port pressure from one protocol does not unnecessarily consume port ranges for the other protocol. If desired, the stage 1 translator could even pass different parameters in the portsel field for UDP and TCP, for different port allocation strategies. The dedicated/shared split could also be different per protocol. On the other hand, it would be simpler to use the same set of port ranges for UDP and TCP; it halves the size of the data structures, and avoids confusion in cases of querying "who was using port X?" when it's not clear which protocol was involved. ## Logging and traceability For abuse tracking and law enforcement purposes, it is necessary to be able to trace activity back to the source IPv6 address. It is assumed this can be weakened to "the source /64 network", since if it were an IPv4 network it would probably be sharing a single IPv4 address with NAT44 anyway. The algorithms outlined above make it unnecessary to have copious per-session logs. The only logging necessary is note each active IPv6 /64 prefix seen, along with the translated IPv4 address and port range(s) used by that prefix. This could be done separately for each period of (say) 24 hours. For that period you would record: * Each active IPv6 /64 prefix seen * The translated IPv4 address * The port range(s) used - e.g. as a 63-bit bitmap with one bit for each range of 1024 ports If each of these records takes 32 bytes, then 64M active prefixes would only take 2GB, and this would be flushed to disk once per day. A query regarding suspicious activity from source address X and source port Y (at time T) could then be mapped back: often to a single source, sometimes 2, very rarely 3 or more, without expensive logs. The fact that there may be more than one upstream match is hoped not to be a problem in practice. If there are two possible sources, then the investigation needs to try them both. Often the context will quickly make it clear which is the right one. The culprit may claim "reasonable doubt" but it's only necessary to exclude the other one. ## Configuration * All of the stage 1 servers need to know what IPv4 prefixes are available in the entire cluster. It's important that this information is *consistent* across all stage 1 servers, so that the each server chooses the same IPv4 address for each source IPv6 prefix * The network infrastructure needs to know which IPv4 prefix(es) to route to each stage 2 server. * The stage 2 servers need to be configured with the cluster's public IPv6 prefix, so that in step (6) they can construct a source address directly without having to return traffic via the stage 1 servers for another translation. There are several ways this could be done. The simplest way would be to statically configure the list of IPv4 prefixes on each stage 1 server, and use static routes to direct traffic to the stage 2 servers. This could use any bulk system administration tool or a shared configuration directory. Better would be to statically configure IPv4 prefixes on the stage 2 servers, have them announce both the IPv4 and mapped IPv6 versions to the network; the stage 1 servers could then learn the full set of IPv4 prefixes from the mapped IPv6 route announcements. Yet another way would be to have full central control, assign each server just a loopback address, and distribute all the routes in iBGP from a central control node (in the same way that some ISPs distribute customer static routes). ~~~ STAGE 1 STAGE 2 +---+ +---+ RTRS ..... | |+ ... | |+ ..... RTRS ^ +---+|+ +---+|+ ^ | +---+| +---+| | | +---+ +---+ | \ ^^^ / \ \\\ / \ +----------+ / `------- | route |------' |reflectors| +----------+ ^ | iBGP control node ~~~ Routes distributed via iBGP: * fd00:0000:c000:0200/58 next hop=S2a loop6 * fd00:0000:c000:0240/58 next hop=S2b loop6 * fd00:0000:c000:0280/58 next hop=S2c loop6 * fd00:0000:c000:02c0/60 next hop=S2c loop6 * 192.0.2.0/26 next hop=S2a loop4 * 192.0.2.64/26 next hop=S2b loop4 * 192.0.2.128/26 next hop=S2c loop4 * 192.0.2.192/28 next hop=S2c loop4 This approach would allow a central control panel to move IPv4 blocks between stage 2 servers, without having to login to or reconfigure the servers themselves. The stage 1 servers would learn the full set of available IPv4 prefixes from the same announcements (i.e. the IPv6 intermediate prefixes routed to the stage 2 servers) and use this both for routing traffic, and for learning the set of available IPv4 prefixes for the consistent hash algorithm. # Additional implementation options ## Mixed static and dynamic operation It's perfectly reasonable to use a static mapping for certain specified IPv6 prefixes, and use dynamic mapping for everything else. All that is needed is that any IPv4 addresses which are used for static mapping are excluded from the consistent hashing algorithm. In addition, static IPv6 mappings may be for mixed and overlapping prefixes, for example a single host (/128) could have its own dedicated port range, whilst other hosts in the same /64 could share a different range. This would be implemented as a longest-matching-prefix rule. ## Source port selection and NAT type Subject to any chosen port range, the translated source port is then entirely up to the stage 2 NAT64. The essential constraint is that each UDP or TCP session must have a unique tuple of (translated source address, translated source port, destination address, destination port). The NAT64 has full freedom in the choice of translated source port, but the other three values are fixed. If a translated source port is dedicated to a particular tuple of (original source address, original source port) then this makes a "cone NAT", and this is able to support certain direct peer-to-peer traffic patterns (e.g. STUN/ICE) which would be extremely helpful for certain applications. However it increases the pressure on the limited pool of available source ports. Alternatively, source ports can be reused when talking to a different destination address and/or destination port. This gives a "symmetric NAT" behaviour, which does not support these peer-to-peer applications. Some possible compromises include: * Use symmetric NAT behaviour for TCP * Use cone NAT for UDP, except when available ports are exhausted * Use symmetric NAT only in in the shared port range and only when there are multiple prefixes using that range ## Positioning of destination IPv4 address The examples so far have shown the translator advertising a /96 prefix with the destination IPv4 address in the last 32 bits. The NAT64 numbering scheme in RFC6052 allows the target IPv4 address to be carried higher up in the IPv6 address than the last 32 bits. For this application there seems no particular reason to do this. If it were done, it would require moving the destination IPv4 address down to the last 32 bits anyway during the formation of the intermediate address. ## Use for NAT44 This architecture is also easily adapted for NAT44, by making the first stage a stateless NAT46, using all 32 bits of the IPv4 source address for the consistent hash, and transforming the source address to `::ffff:x.x.x.x`. There is no need for the prefix selector in this case. The stage 1 hosts will need both IPv4 and IPv6 interfaces. The stage 2 hosts can dump the return traffic directly into IPv4, through their existing IPv4 interface. # Scaling As there is no shared state, additional stage 1 boxes can be added at will (as either CPU or port bandwidth limits are approached). The only requirement is that the network can distribute incoming traffic evenly across them; this may be done by equal-cost multipath. On the stage 2 boxes, there are two cases: * If the sharing ratio on the IPv4 pools is becoming unacceptably high, then additional IPv4 prefixes can be added - either by adding new servers with these prefixes, or breaking up the new prefixes and distributing them across existing servers. (This makes no difference to the source address selection algorithm, as it doesn't care which server manages which address) * If the load on the boxes is becoming unacceptably high, then additional boxes can be added, either with their own new IPv4 prefixes, or by taking some IPv4 addresses away from existing boxes. Any such changes would result in the consistent hashing algorithm redistributing a proportion of users onto new IPv4 addresses or servers, and for those users, any ongoing sessions would be interrupted. Therefore it is desirable to make such changes only occasionally, perhaps at well-known (weekly?) maintenance times. The ideal size of IPv4 pool per server would most likely be learned through experience, and hence scaling of stage 2 could be done by adding new servers with new IPv4 pools of the correct size, leaving other servers unchanged. It is not necessary for the stage 2 servers to be homogenous: higher-performance servers can be given larger IPv4 pools than the others. It is desirable for the stage 1 servers to be similar, but only because it is may not be easy to configure multipath load-balancing to weight traffic to different destinations. ## Sizing calculations Suppose we take the following back-of-envelope parameters: * 16:1 IP sharing ratio * 0.1Mbps average translated traffic per user Then a /23 IPv4 block (512 addresses) would amount to about 820Mbps of traffic, suitable for a stage 2 server with 1Gbps ports. If we decided that the IP sharing ratio should be only 8:1, then the same server would require a /22 IPv4 block (1024 addresses). A server with 10Gbps ports could be assigned ten times as many IPv4 addresses - unless it were CPU or RAM-bound, in which case the number of IPv4 addresses would be in proportion to its actual maximum processing capability. Note that Stage 1 boxes are handling traffic in one direction only, and so a single NIC would be equally utilised in both "in" and "out" directions. Stage 2 boxes are handling traffic in both directions; hence if the expected traffic is similar in both directions then separate ports for IPv4 and IPv6 would be beneficial. In practice, the traffic-per-user figure would have to be learned by experience, and may change over time as different types of users start using the NAT64 service. Having (say) 8,000 users (= 512 addresses x 16) going through one server would hopefully smooth out most peaks; however it's possible that if a handful of high-volume users suddenly make big demands this would result in a spike, and therefore it may be wise to aim for a lower steady-state peak. Servers do not have to be given equal-sized or contiguous pools. Indeed it is desirable to break addresses down into smaller chunks to give more fine-grained control. For example the 1G server could initially be given 32 x /28 blocks and the 10G server 320 x /28 blocks; later on the 1G server can have individual /28 blocks added or taken away (moved to other servers). This minimises the impact on end-users, as the consistent hashing change will only affect a small proportion of the users on that server. ## Consistent hash table implementation The size of the consistent hash data structure increases with the size of the IPv4 pool, and has to be available for each stage 1 server to consult. To take a moderately large scenario, let us consider: * A total IPv4 space of /12 equivalent (1M addresses; at 16:1 this is enough for 16M concurrent users) * M=256, i.e. each address appears 256 times on the CH ring * An efficient data structure taking an average of 16 bytes per entry. (Each entry maps a 64-bit hash key to a 32-bit IPv4 address, making 12 bytes, and there will be overhead in the data structure too. However the key space is well balanced, and an efficient implementation would be able to share prefixes of the key) This would require a total data structure with 256M entries and memory usage of 4GB, which is certainly feasible. Given such a large pool of address space, M=256 may not be necessary; M=64 may give sufficiently even balance. With a total IPv4 address space of /8 (surely more than a single cluster would ever have!) and M=64, the RAM requirement is still only 16GB. The data structure would have to be designed for efficient storage and lookup, but this is a well-explored area, and if the speed of lookup is the limiting factor then more stage 1 boxes can be added. The essential requirement is to be able to search for a particular key, and locate the key/value immediately preceding that key. Judy Arrays may be a suitable choice: `judyl` maps a 64-bit index to a 64-bit value, and the function `JLL()` will locate the last index equal to or less than the one given. (TODO: prototype judyl and measure its memory usage and lookup performance. [This article](http://preshing.com/20130107/this-hash-table-is-faster-than-a-judy-array/) suggests average memory usage at or below 14 bytes per item, and lookup times of around 500ns on a Core 2 Duo, as long as the cache is not under heavy external pressure) ## Processing At an average packet size of 512 bytes, 820Mbps of traffic is 200K packets per second. Although it would be cost-effective if a single box could achieve this, the horizontal scalability makes this moot. Also, as the size of the user base increases, the rate at which sessions are created and destroyed goes up. This is also divided across the available stage 2 boxes and can be scaled accordingly. ## Equal-cost multipath Since additional stage 1 and stage 2 boxes can be added as required, the remaining scaling limitation is the ability of the network to distribute incoming traffic amongst a large number of stage 1 boxes. Existing network devices may have inherent limits as to the number of destinations they may distribute between. This could be addressed by having multiple tiers of routing: e.g. tier 1 distributes amongst N routers, each of which in turn distributes amongst M destinations. Note that only a single IPv6 prefix needs to be handled in this way: the translator cluster's overall IPv6 prefix (e.g. NPFX::/96), or possibly a small number of prefixes if the cluster supports multiple IPv6 prefixes. ## Interconnect The interconnect (switched and/or routed) between stage 1 and stage 2 has to be able to carry the entire traffic volume. Since this is unicast, and no more than the total traffic entering the translator cluster, this is no harder to build than delivering the required traffic to the translator cluster in the first place. For Internet-scale deployment, there would be multiple, independent translator clusters dotted around the Internet. This is the subject of a [separate paper](candler-interconnecting-the-internets.md.html). # Management ## Failover If a stage 1 box fails, traffic will simply be redistributed over the other stage 1 boxes (as soon as the network load balancer detects this) and there will be no impact. If a stage 2 box responsible for a particular IPv4 range fails, then traffic for those users will be redistributed across the remaining IPv4 address space by the consistent hashing algorithm. This will keep the cluster balanced, but will interrupt any ongoing sessions for those users. Alternatively, it would be possible to run servers in pairs: one server is primary for block A and backup for block B, and the other is primary for block B and primary for block A. While they run they keep their state tables in sync for both ranges, so that if one fails, the other can take over immediately. The OpenBSD "pfsync" mechanism provides an example of how this can be implemented. This mechanism may make such failures less noticable, at least at off-peak times when servers are below 50% capacity; it could also be useful for performing maintenance. In practice: unscheduled failures may be sufficiently rare for this not to be a problem. ## Scheduled maintenance For scheduled maintenance, all that is necessary is to have a few spare stage 2 hosts, and to be able to sync the to-be-maintained host's NAT64 state with a spare host, before failing over. Since the traffic cannot be swung instantaneously, ideally the states should remain in sync bi-directionally while the IPv6 traffic (stage1 to stage2) and external IPv4 traffic (Internet to stage2) is rerouted. ## Changes of IPv4 pools When IPv4 pools change, some existing sessions will need to be interrupted. It would be helpful if the stage 2 translator could send a RST for existing TCP sessions when IPv4 pools are removed from it (and then remove its state entries), unless it has a failover partner. # Security considerations ## Direct use of intermediate addresses If an end-user were able to send traffic to the stage 2 intermediate address prefix, they would be able to select an arbitrary IPv4 source address (and/or port) for their outgoing traffic. Hence this should blocked, for example by ACLs at the edge, or by making the stage 1 to stage 2 interconnect a completely separate routing domain. Using ULA addresses is also helpful for this reason. Note that if a single AS contains multiple translator clusters, it would be wise for each cluster to use a distinct intermediate prefix (especially if a single iBGP mesh includes all translators) ## Denial of Service Any NAT device is sensitive to DoS, particularly explosion of the state table, and the stage 2 NAT64 in this design is no different. ### Spoofed token (lower 64 bits) IPv6 allows the sender to choose any of 2^64 possible source addresses within a prefix. This is a fundamental feature of the current IPv6 addressing architecture. So whilst it would be desirable to keep statistics on utilisation for each individual /128 address, if an attacker wants to hide her usage she can simply continue to pick random source addresses until the NAT is no longer able to keep track. She can also respond to traffic to all those addresses, e.g. to complete a 3-way TCP handshake. The NAT therefore has no way to distinguish between genuine and spoofed traffic. To protect itself, the NAT will need to limit state generation at the level of the /64 prefix, which means the attacker will be performing a DoS against other users on her own network. This can only be traced by the local network administrator, e.g. by looking at NDP tables. ### Spoofed prefix (upper 64 bits) Unfortunately, many service providers do not have ingress filters to prevent source address spoofing, and so the incoming source addresses arriving at the translator may be completely arbitrary. The problems this can cause include: * Creation of many useless translation state entries * Exhaustion of source ports * False logging of IPv6 prefixes as "active", and thus junk mappings of IPv4 address/port to spoofed IPv6 addresses To avoid false logging and allocation of a port range to a spoofed prefix, IPv6 prefixes should only be marked "active" after at least one successful three-way TCP exchange. To avoid the useless state entries and source port exhaustion, the stage 2 NAT may need to engage some mechanism similar to "SYN cookies" so that long-lived NAT state is not created until after a successful three-way TCP exchange. UDP traffic cannot be protected in this way, as we have no way of knowing whether return UDP traffic was successfully delivered or not. More heuristic methods may be required. It could be said that few devices would use UDP without any TCP at all; therefore the successful establishment of TCP from a given IPv6 address could whitelist that address for UDP as well. However if an attacker obtains or guesses a valid source IPv6 address then they can spoof traffic which is indistinguishable from genuine traffic from that address. It may therefore also be necessary to limit the rate of UDP state creation or the total number of UDP states per source. Some devices (e.g. SIP phones) may use UDP exclusively - although SIP is unlikely to work well with NAT64 anyway. If we allow the successful establishment of TCP from anywhere in a /64 prefix to whitelist the whole prefix for UDP, this is unlikely to be a problem. ### State and port exhaustion Even without address spoofing, a client can create a large number of TCP sockets and a large number of UDP sockets, and consume resources on the translator. If a cap is set at the limit of the /64 prefix, then the user will be able to perform a DoS against other users in their own network. If a cap is set at the limit of the /128 address then this can be avoided, however the attacker can easily circumvent this by choosing different source addresses as described above. ### Long-lived TCP sessions TCP sessions can hold state for an extended period of time, especially if the client or server vanish, and may increase utilisation towards the cap. Hence stale sessions must be pruned, at least in times of high demand. (TODO: can the NAT64 inject TCP keepalives even if the endpoints themselves are not using it?) ### UDP sessions If a client binds to one socket and sends to many destinations, we SHOULD use the same translated source port, so that STUN/ICE can work. However if there is much churn of client sockets, there could be much pressure on the available port space, and may have to fall back to shared port use (symmetric NAT). It is probably realistic to time out UDP translations after 30-60 seconds of inactivity. Clients have an expectation of having to refresh NAT state - although if they are on an IPv6-only network they may not realise that some of their traffic is going via a NAT64. Ports could be re-used in a LRU order, but this would make problems harder to debug - it is probably better to have a fixed UDP timeout. ### DNS A common source of UDP state is DNS. There is no good reason for anyone to use NAT64 to translate DNS queries. An IPv6-only user should be talking to a DNS(64) cache over IPv6, and that cache should be dual-stack. Anything else is misconfiguration. Therefore it is perfectly reasonable to block UDP and TCP port 53 entirely at the translator - or to return a canned DNS response with the fixed IP address of a webserver which explains the problem, essentially a captive portal. ### ICMP It is helpful for the NAT64 to work with ICMP echo request. This would mean that an end-user with a CLAT would be able to do "ping 8.8.8.8" and get a response - this means "The Internet Is Working [TM]". Such state can be very short-lived (of order of 5 seconds) and the number of concurrent states from a given prefix can be limited, and/or traffic heavily rate limited. ### Small packets A system tuned to handle a certain traffic volume under the assumption of an average packet size of (say) 512 bytes per packet may become overwhelmed given a stream of small packets of 64 bytes, as this will demand 8 times the processing. ### Statistics There should be statistical monitoring of both traffic (bps and pps) and state generation from active prefixes (aggregated per /64, per /56, per /48, and per /32), and the ability to apply temporary blocks if required. ## Issues with dynamic A+P A static A+P deployment will explicitly tie each address/port combination back to one source, but this may not always be true for dynamic A+P. ### Selection of source IPv4 address An attacker with a /48 route from their ISP can choose whichever public IPv4 address from the pool they want (or at least, any one of 65,536 choices), simply by rotating through their 2^16 available prefixes. If the hash algorithm and IPv4 pool ranges are public they can even do this off-line. This could be used to make an attack appear like it is coming from multiple sources, when they are in fact the same source; it can divide a large volume of traffic into 2^16 smaller streams. If the target queries each of the IPv4 addresses from the translator logs, they will find that all those addresses include mappings to prefixes within the same /48 range, and may be able to infer the true source of the attack. Prefix selection can also be used to purposely make the attacker's traffic come from the same IPv4 address used by a different, trustworthy network. To some extent this is inherent to the concept of address sharing, but in this case the attacker is allowed to select their sharing partner. If the translator cluster has less than a /16 of address space in total then the attacker will be able to find multiple prefixes which map to the same IPv4 address and consume an unfair share of dedicated ports on that address. To address this issue, we could consider using only the first 48 bits of the source IPv6 address in the consistent hash algorithm. The problem this would cause is when there is a genuine large site with a /48 block (say, a university): we do not want every single network in the university to map to the same IPv4 address, as this would create excessive load in a single stage 2 server (traffic load, state, and demand on available source ports). We could also consider using 56 bits, given that many ISPs are allocating /56's to end users. Such a user would have no choice over their translated IPv4 address, and a user with a /48 would only be able to choose between 256 of them. However a large network like a university might have to reorganise their prefixes to distribute load among those 256 available translated addresses. This point remains open to discussion, but from a basic engineering point of view it is still preferable to use 64 bits of the prefix to give an even distribution of addresses for larger client networks. In any case, users may easily obtain additional /48 blocks (e.g. from tunnel brokers) or even a /32 or more by joining a RIR. At worst there is always the option of blocking traffic from any ranges causing persistent abuse. If there is a need for them, RBL-style blacklists for IPv6 will spring up. ### Selection of source port The port selection algorithm is designed to allow "busy" networks to make use of a large number of ports in a shared range. An attacker can easily open multiple sockets bound to multiple addresses and create as many ports as they wish. This would be a denial-of-service against their own network. If the particular IPv4 address has one or more legitimate, "busy" networks on it, then the attacker may end up using some port ranges which are shared with those networks. This would be intended to mislead an investigation. However at worst it would only increase the number of leads which have to be followed - the information that the attacker was using a particular port range would not be lost, only that there are multiple possible users of that port range. Even if there is a 16:1 sharing ratio, then at worst 16 networks would be using the same extended port ranges; in practice it is likely to be far lower. ### Dense networks In some cases a single /64 prefix may be supporting many more than 250 devices (e.g. a large hotel, or a conference wireless network). In this translator design, they will all be mapped to the same IPv4 address and so will be sharing a single port range, which will suffer severe pressure - as indeed happens today if the hotel built their network with NAT44 and a single public IPv4 address. It would be better engineering if the hotel were to divide their network into subnets, which would spread the load across multiple IPv4 addresses, or even route a separate /64 to each room. The mitigating factor here is that if the hotel has built a pure IPv6-only network, then at least connections to dual-stack destinations will continue to work just fine, even if IPv4 ports are exhausted in the translator, rather than suffering a total network collapse. # References TODO: Use the proper NAT64 terminology throughout * RFC1071/1141/1624: Checksum algorithm * RFC5254: ICE * RFC5389: STUN * RFC6052: IPv6 Addressing of IPv4/IPv6 Translators * RFC6144: Framework for IPv4/IPv6 Translation * RFC6145: IP/ICMP Translation Algorithm * RFC6146: Stateful NAT64 * RFC6147: DNS64 * RFC6296: IPv6 NPT (NAT66) * RFC6877: 464XLAT * RFC7225: NAT64 prefix discovery with PCP * RFC7269: NAT64 deployment options and experience * Judy arrays: * NAT types: