1 Abstract
2 Motivation
3 High-level design
3.1 Two-layer architecture
3.2 Format of intermediate addresses
3.3 Example of operation
3.4 Checksum compensation
3.5 Multiple translator prefixes
4 Static A+P configuration
5 Dynamic A+P operation
5.1 Consistent hashing
5.2 Masked IPv6 source bits
5.3 Distribution of source networks
5.4 Source port requirements
5.5 Source port selection algorithm
5.5.1 UDP and TCP
5.6 Logging and traceability
5.7 Configuration
6 Additional implementation options
6.1 Mixed static and dynamic operation
6.2 Source port selection and NAT type
6.3 Positioning of destination IPv4 address
6.4 Use for NAT44
7 Scaling
7.1 Sizing calculations
7.2 Consistent hash table implementation
7.3 Processing
7.4 Equal-cost multipath
7.5 Interconnect
8 Management
8.1 Failover
8.2 Scheduled maintenance
8.3 Changes of IPv4 pools
9 Security considerations
9.1 Direct use of intermediate addresses
9.2 Denial of Service
9.2.1 Spoofed token (lower 64 bits)
9.2.2 Spoofed prefix (upper 64 bits)
9.2.3 State and port exhaustion
9.2.4 Long-lived TCP sessions
9.2.5 UDP sessions
9.2.6 DNS
9.2.7 ICMP
9.2.8 Small packets
9.2.9 Statistics
9.3 Issues with dynamic A+P
9.3.1 Selection of source IPv4 address
9.3.2 Selection of source port
9.3.3 Dense networks
10 References
This document outlines a design for a NAT64 translator cluster which can scale to arbitrary traffic volumes, as would be required for translating a non-trivial proportion of Internet traffic.
Commercial large-scale or “carrier-grade” NAT64 implementations are available, but they are proprietary, expensive, and of uncertain scalability.
The design outlined here has the following characteristics:
A collection of hosts logically form a single “translator cluster”, which translates IPv6 traffic to destination addresses of the form NPFX::z.z.z.z to IPv4 address z.z.z.z (where NPFX is the chosen IPv6 prefix for the cluster)
Internally, the cluster comprises a number of hosts grouped in two stages: a NAT66 stage and a NAT64 stage. (Combining both functions into a single host is perfectly possible but not considered here)
s USER:a
s USER:a d STG2:x:y:z s x (IPv4)
d NPFX::z +---+ ,----------------->+---+ d z (IPv4)
------------->| | --.,--------------->| | <---------->
+---+ \/\ +---+
+---+ /\ `-------------->+---+
------------->| | ---\--------------->| | <---------->
+---+ \ \ +---+
STAGE 1 \ `------------->+---+
`--------------->| | <---------->
+---+
STAGE 2
The first stage is a stateless, deterministic selection of IPv4 source address as a function of the IPv6 source address (USER:a). All stage 1 hosts are identical. The destination IPv6 address is rewritten to an intermediate form which contains both the selected IPv4 source (x) and target destination IPv4 address (z). These intermediate addresses lie within a internal prefix STG2::/32
IPv6 routing then naturally delivers the packet to the correct stage 2 box which announces the prefix covering STG2:x::/64.
The second stage is a traditional stateful NAT64, except that the source IPv4 address (x) has already been selected and is extracted from the IPv6 destination address.
Each stage 2 box “owns” its own range or ranges of IPv4 addresses, and regular IPv4 routing is used to ensure the return traffic hits the right box. After conversion back to IPv6, the return traffic can be dumped straight back into the IPv6 network, and need not touch the Stage 1 boxes again.
Scalability is therefore achieved by:
The intermediate addresses are rewritten inside a /32 IPv6 prefix. This need not be globally routable: ULA addresses are fine. The example here uses fd00::/32
These addresses are structured as follows:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| STG2 prefix | IPv4 source |chkcomp|portsel| IPv4 dest |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
fd 00 00 00
(example)
The field “chkcomp” is used to compensate for changes in the upper layer checksum which would otherwise be required, using the algorithm in RFC6296.
The field “portsel” may be used to influence stage 2 translator operation, and is described later but is not required for basic NAT64 operation.
Both fields are shown as zero in the following exposition.
Let's say we have three stage 2 servers:
Each S2 server announces an IPv6 prefix or prefixes with the IPv4 prefix(es) it owns shifted left 64 bits (i.e. in bits 32-63 of the IPv6 address). These announcements are made towards the S1 servers.
fd00:0000:c000:0200::/58
fd00:0000:c000:0240::/58
fd00:0000:c000:0280::/58
and
fd00:0000:c000:02c0::/60
fd00:0000 : c0 00 : 02 40 : 0000:0000:0000:0000
11000000 00000000 00000010 01000000
---------------------------------------->| /58
is equivalent to
fd00:0000 : 192 . 0 . 2 . 64 : 0000:0000:0.0.0.0
11000000 00000000 00000010 01000000
---------------------------->| /26
Let's assume the IPv6 translator prefix announced to the public IPv6 Internet is 2001:db8:6464::/96.
Now consider what happens when an IPv6 packet arrives with source 2001:db8:1234::5 and destination 2001:db8:6464::203.0.113.99.
fd00:0000:192.0.2.100::203.0.113.99
, more correctly represented as
fd00:0000:c000:0264::203.0.113.99
, and sends it out.
fd00:0000:c000:0264::203.0.113.99
is within
fd00:0000:c000:0240::/58
)
The above example is only slightly simplified: it ignores the checksum compensation (in step 3); it assumes that source port selection is entirely at the discretion of the stage 2 box (in step 5); and that the cluster has only one public IPv6 translator prefix (in step 6). Those details will be presented next.
Bits 64-79 of the translated destination address (the “chkcomp” field) are set using the algorithm in sections 3.2 and 3.5 of RFC6296, to make the overall stage 1 translation checksum-neutral. This is a simple calculation which avoids any further need for the stage 1 translator to inspect or update the upper-level protocol checksum (e.g. TCP/UDP).
It is possible the translator cluster operator will want to serve multiple IPv6 NAT64 prefixes using the same cluster. Examples are:
2001:db8:6464::/96
and an anycast block like 64:ff9b::/96
The mechanism proposed is to use the lower two bits from the “portsel” field for this (bits 80-95), which is the remaining 16-bit field in the intermediate translated address.
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
80 94 95
Both the stage 1 and stage 2 translators are configured with a static mapping table of up to four public IPv6 prefixes.
The stage 1 translator inserts the P bits in the intermediate address, dependent on the original destination IPv6 address seen (before it is rewritten into the intermediate form). The stage 2 translator records them along with the session state; and when incoming IPv4 packets are received, uses this state to select the appropriate IPv6 source prefix for return packets.
(An alternative would be to use high bits from the STG2 prefix, e.g. bits 16-31)
Some ISP/carrier configurations may wish to use an explicit static mapping from IPv6 prefixes to IPv4 source address and port range. This can be configured simply using a static prefix lookup table, distributed to each of the stage 1 servers.
The stage 1 server selects both an IPv4 address and a port range for each IPv6 prefix. The remaining “portsel” bits in the intermediate IPv6 address are used to signal the selected port range to the stage 2 host.
The exact use of these bits is up to the design of the translator cluster, but here is a simple approach:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| B B B B B B | 0 0 | N N N N N N | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
A value of all zeros in all of the first 14 bits (B=0, N=0) means that the stage 2 translator is permitted to use the whole range of source ports (1024-65535). Otherwise it breaks down as:
This allows port ranges of between 1024 ports up to the full range.
Example:
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 1 0 0 1 1 0 0 0 | 0 0 0 0 1 0 0 0 |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
For Internet-scale use it is not practical to configure explicit per-client mappings. A zero-configuration approach is described here.
The approach by which a stage 1 server selects an IPv4 address to use is called consistent hashing. It works as follows.
Crucially however, the majority of users continue to use their existing IPv4 address unchanged, and hence their NAT64 sessions are unaffected.
The following example shows prefix 192.0.2.0/30 (4 IPv4 addresses), M=4, and a 16-bit hash space
*0fee [192.0.2.0]
[192.0.2.3] ff7c* *3559 [192.0.2.1]
[192.0.2.2] f9ee* *3d95 [192.0.2.1]
[192.0.2.3] f763* *403e [192.0.2.3]
[192.0.2.0] e482* *4ace [192.0.2.1]
[192.0.2.2] e1c5* *4cdd [192.0.2.3]
[192.0.2.0] e0c2* *5e23 [192.0.2.1]
[192.0.2.2] e0bc* *bd91 [192.0.2.0]
*d154 [192.0.2.2]
These values were obtained by taking the first 16 bits of the MD5 hash of each of the following strings:
The actual hash doesn't really matter, as long as it gives a reasonable random spread. The example above turns out to quite badly distributed: some ranges are very large (5e23...bd91) and some tiny (e0bc...e0c2), and the overall distribution of IPv4 usage is:
In practice, a higher-performing algorithm than MD5 would be used, at least when processing the incoming IPv6 source addresses; and all the processing would be in binary not ASCII representation.
A suitable hash algorithm would give good random spread and yet be efficient to implement on 64-bit processors and/or direct hardware implementations.
(TODO: assess suitability of algorithms. MurmurHash3-128? CityHash64? Older functions like CRC-64-ECMA, Fletcher-64? Note: Nehalem/i7/SSE4.2 has a CRC32 primitive on-chip!)
(TODO: if two addresses hash to exactly the same value, need to define which takes precedence)
It is not desirable to hash the whole 128 bits of source IPv6 address when selecting the IPv4 address to use.
If we used all 128 bits, then all the users in any network will be spread evenly over all the available IPv4 addresses; the use of periodically-changing privacy addresses will ensure that one network will eventually make use of all available IPv4 addresses at some point.
We can improve this by hashing only a portion of the source IPv6 address.
If we take a hash of the first 64 bits only, then this means that all the users in one particular network will map to the same public IPv4 address. This is in any case what users expect when sitting behind a traditional NAT. Such mappings would rarely change (only when IPv4 prefixes are added to or removed from the translator cluster).
Given a particular usage, the consistent hash will aim to spread usage evenly over the available IPv4 addresses. For example, if there are 4 million users and 1 million IPv4 addresses, on average each address will be in use by 4 users.
Some addresses will be used by fewer, and some by more, although the probability of an individual address being used by (say) 6 or more users will be low.
TODO: Do the math
Empirically, we know that a typical office or school network with NAT44 normally has a single public IPv4 address, and it works fine. If we take it as good practice that a layer 2 broadcast domain (subnet) has up to 250 devices on it, then we believe those 250 devices happily share a range of around 64,000 ports. If all were active at the same time, they would be using 256 ports each on average; if only a quarter were active at the same time then they would be happy with an average of 1024 ports each. This also ties up with our experience of client devices: if you type “netstat” on a client device it would be rare to see many hundreds of open sockets.
This means that in principle, an individual end user or small home network might be happy with an allocation of maybe as little as 1024 ports. However a larger office or school network (also a /64) may require much more.
Our port allocation strategy has to allow for this, whilst ideally maintaining separate port ranges for each user. Here is a proposed approach.
The port space is divided into 64 blocks of 1024 ports. Block 0 is reserved. The remainder are split into two ranges, “dedicated” and “shared”.
+-----------------+
| 63 |
| ... SHARED |
| D+1 |
+-----------------+
| D |
| ... DEDICATED |
| 1 |
+-----------------+
| 0 Reserved |
+-----------------+
This is a static, system-wide split. For example, D=31 gives ports 1024-32767 in the dedicated range, and 32768-65535 in the shared range.
Dynamic port allocation is signalled using the following values in portsel: the value would likely be fixed in the stage 1 servers.
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
| 0 0 0 0 0 0 L L | N N N N N N | P P |
+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+---+
Now let's look at the characteristics of this algorithm when given an aggressive sharing ratio of 16:1 for /64 prefixes to IPv4 addresses, and D=31, N=2, L=3.
The upshot of this is that activity from IPv4 address X port Y can be mapped with a good degree of certainty to a small number of IPv6 source /64 prefixes, usually one.
This gives a “pseudo A+P” architecture, where we have not statically allocated exactly one customer to each port range, but the number of users per port range is small.
Separate port range assignments could be maintained for UDP and TCP for each /64 prefix, so that heavy port pressure from one protocol does not unnecessarily consume port ranges for the other protocol.
If desired, the stage 1 translator could even pass different parameters in the portsel field for UDP and TCP, for different port allocation strategies. The dedicated/shared split could also be different per protocol.
On the other hand, it would be simpler to use the same set of port ranges for UDP and TCP; it halves the size of the data structures, and avoids confusion in cases of querying “who was using port X?” when it's not clear which protocol was involved.
For abuse tracking and law enforcement purposes, it is necessary to be able to trace activity back to the source IPv6 address. It is assumed this can be weakened to “the source /64 network”, since if it were an IPv4 network it would probably be sharing a single IPv4 address with NAT44 anyway.
The algorithms outlined above make it unnecessary to have copious per-session logs.
The only logging necessary is note each active IPv6 /64 prefix seen, along with the translated IPv4 address and port range(s) used by that prefix.
This could be done separately for each period of (say) 24 hours. For that period you would record:
A query regarding suspicious activity from source address X and source port Y (at time T) could then be mapped back: often to a single source, sometimes 2, very rarely 3 or more, without expensive logs.
The fact that there may be more than one upstream match is hoped not to be a problem in practice. If there are two possible sources, then the investigation needs to try them both. Often the context will quickly make it clear which is the right one. The culprit may claim “reasonable doubt” but it's only necessary to exclude the other one.
Better would be to statically configure IPv4 prefixes on the stage 2 servers, have them announce both the IPv4 and mapped IPv6 versions to the network; the stage 1 servers could then learn the full set of IPv4 prefixes from the mapped IPv6 route announcements.
Yet another way would be to have full central control, assign each server just a loopback address, and distribute all the routes in iBGP from a central control node (in the same way that some ISPs distribute customer static routes).
STAGE 1 STAGE 2
+---+ +---+
RTRS ..... | |+ ... | |+ ..... RTRS
^ +---+|+ +---+|+ ^
| +---+| +---+| |
| +---+ +---+ |
\ ^^^ /
\ \\\ /
\ +----------+ /
`------- | route |------'
|reflectors|
+----------+
^
| iBGP
control
node
Routes distributed via iBGP:
It's perfectly reasonable to use a static mapping for certain specified IPv6 prefixes, and use dynamic mapping for everything else.
All that is needed is that any IPv4 addresses which are used for static mapping are excluded from the consistent hashing algorithm.
In addition, static IPv6 mappings may be for mixed and overlapping prefixes, for example a single host (/128) could have its own dedicated port range, whilst other hosts in the same /64 could share a different range. This would be implemented as a longest-matching-prefix rule.
Subject to any chosen port range, the translated source port is then entirely up to the stage 2 NAT64. The essential constraint is that each UDP or TCP session must have a unique tuple of (translated source address, translated source port, destination address, destination port). The NAT64 has full freedom in the choice of translated source port, but the other three values are fixed.
If a translated source port is dedicated to a particular tuple of (original source address, original source port) then this makes a “cone NAT”, and this is able to support certain direct peer-to-peer traffic patterns (e.g. STUN/ICE) which would be extremely helpful for certain applications. However it increases the pressure on the limited pool of available source ports.
Alternatively, source ports can be reused when talking to a different destination address and/or destination port. This gives a “symmetric NAT” behaviour, which does not support these peer-to-peer applications.
Some possible compromises include:
The examples so far have shown the translator advertising a /96 prefix with the destination IPv4 address in the last 32 bits.
The NAT64 numbering scheme in RFC6052 allows the target IPv4 address to be carried higher up in the IPv6 address than the last 32 bits. For this application there seems no particular reason to do this. If it were done, it would require moving the destination IPv4 address down to the last 32 bits anyway during the formation of the intermediate address.
This architecture is also easily adapted for NAT44, by making the first
stage a stateless NAT46, using all 32 bits of the IPv4 source address for
the consistent hash, and transforming the source address to
::ffff:x.x.x.x
. There is no need for the prefix selector in this case.
The stage 1 hosts will need both IPv4 and IPv6 interfaces. The stage 2 hosts can dump the return traffic directly into IPv4, through their existing IPv4 interface.
As there is no shared state, additional stage 1 boxes can be added at will (as either CPU or port bandwidth limits are approached). The only requirement is that the network can distribute incoming traffic evenly across them; this may be done by equal-cost multipath.
On the stage 2 boxes, there are two cases:
The ideal size of IPv4 pool per server would most likely be learned through experience, and hence scaling of stage 2 could be done by adding new servers with new IPv4 pools of the correct size, leaving other servers unchanged.
It is not necessary for the stage 2 servers to be homogenous: higher-performance servers can be given larger IPv4 pools than the others.
It is desirable for the stage 1 servers to be similar, but only because it is may not be easy to configure multipath load-balancing to weight traffic to different destinations.
Suppose we take the following back-of-envelope parameters:
If we decided that the IP sharing ratio should be only 8:1, then the same server would require a /22 IPv4 block (1024 addresses).
A server with 10Gbps ports could be assigned ten times as many IPv4 addresses - unless it were CPU or RAM-bound, in which case the number of IPv4 addresses would be in proportion to its actual maximum processing capability.
Note that Stage 1 boxes are handling traffic in one direction only, and so a single NIC would be equally utilised in both “in” and “out” directions. Stage 2 boxes are handling traffic in both directions; hence if the expected traffic is similar in both directions then separate ports for IPv4 and IPv6 would be beneficial.
In practice, the traffic-per-user figure would have to be learned by experience, and may change over time as different types of users start using the NAT64 service. Having (say) 8,000 users (= 512 addresses x 16) going through one server would hopefully smooth out most peaks; however it's possible that if a handful of high-volume users suddenly make big demands this would result in a spike, and therefore it may be wise to aim for a lower steady-state peak.
Servers do not have to be given equal-sized or contiguous pools. Indeed it is desirable to break addresses down into smaller chunks to give more fine-grained control. For example the 1G server could initially be given 32 x /28 blocks and the 10G server 320 x /28 blocks; later on the 1G server can have individual /28 blocks added or taken away (moved to other servers). This minimises the impact on end-users, as the consistent hashing change will only affect a small proportion of the users on that server.
The size of the consistent hash data structure increases with the size of the IPv4 pool, and has to be available for each stage 1 server to consult.
To take a moderately large scenario, let us consider:
Given such a large pool of address space, M=256 may not be necessary; M=64 may give sufficiently even balance.
With a total IPv4 address space of /8 (surely more than a single cluster would ever have!) and M=64, the RAM requirement is still only 16GB.
The data structure would have to be designed for efficient storage and lookup, but this is a well-explored area, and if the speed of lookup is the limiting factor then more stage 1 boxes can be added.
The essential requirement is to be able to search for a particular key, and
locate the key/value immediately preceding that key. Judy Arrays may be a
suitable choice: judyl
maps a 64-bit index to a 64-bit value, and the
function JLL()
will locate the last index equal to or less than the one
given.
(TODO: prototype judyl and measure its memory usage and lookup performance. This article suggests average memory usage at or below 14 bytes per item, and lookup times of around 500ns on a Core 2 Duo, as long as the cache is not under heavy external pressure)
At an average packet size of 512 bytes, 820Mbps of traffic is 200K packets per second. Although it would be cost-effective if a single box could achieve this, the horizontal scalability makes this moot.
Also, as the size of the user base increases, the rate at which sessions are created and destroyed goes up. This is also divided across the available stage 2 boxes and can be scaled accordingly.
Since additional stage 1 and stage 2 boxes can be added as required, the remaining scaling limitation is the ability of the network to distribute incoming traffic amongst a large number of stage 1 boxes. Existing network devices may have inherent limits as to the number of destinations they may distribute between.
This could be addressed by having multiple tiers of routing: e.g. tier 1 distributes amongst N routers, each of which in turn distributes amongst M destinations.
Note that only a single IPv6 prefix needs to be handled in this way: the translator cluster's overall IPv6 prefix (e.g. NPFX::/96), or possibly a small number of prefixes if the cluster supports multiple IPv6 prefixes.
The interconnect (switched and/or routed) between stage 1 and stage 2 has to be able to carry the entire traffic volume. Since this is unicast, and no more than the total traffic entering the translator cluster, this is no harder to build than delivering the required traffic to the translator cluster in the first place.
For Internet-scale deployment, there would be multiple, independent translator clusters dotted around the Internet. This is the subject of a separate paper.
If a stage 1 box fails, traffic will simply be redistributed over the other stage 1 boxes (as soon as the network load balancer detects this) and there will be no impact.
If a stage 2 box responsible for a particular IPv4 range fails, then traffic for those users will be redistributed across the remaining IPv4 address space by the consistent hashing algorithm. This will keep the cluster balanced, but will interrupt any ongoing sessions for those users.
Alternatively, it would be possible to run servers in pairs: one server is primary for block A and backup for block B, and the other is primary for block B and primary for block A. While they run they keep their state tables in sync for both ranges, so that if one fails, the other can take over immediately. The OpenBSD “pfsync” mechanism provides an example of how this can be implemented.
This mechanism may make such failures less noticable, at least at off-peak times when servers are below 50% capacity; it could also be useful for performing maintenance. In practice: unscheduled failures may be sufficiently rare for this not to be a problem.
For scheduled maintenance, all that is necessary is to have a few spare stage 2 hosts, and to be able to sync the to-be-maintained host's NAT64 state with a spare host, before failing over.
Since the traffic cannot be swung instantaneously, ideally the states should remain in sync bi-directionally while the IPv6 traffic (stage1 to stage2) and external IPv4 traffic (Internet to stage2) is rerouted.
When IPv4 pools change, some existing sessions will need to be interrupted. It would be helpful if the stage 2 translator could send a RST for existing TCP sessions when IPv4 pools are removed from it (and then remove its state entries), unless it has a failover partner.
If an end-user were able to send traffic to the stage 2 intermediate address prefix, they would be able to select an arbitrary IPv4 source address (and/or port) for their outgoing traffic.
Hence this should blocked, for example by ACLs at the edge, or by making the stage 1 to stage 2 interconnect a completely separate routing domain. Using ULA addresses is also helpful for this reason.
Note that if a single AS contains multiple translator clusters, it would be wise for each cluster to use a distinct intermediate prefix (especially if a single iBGP mesh includes all translators)
Any NAT device is sensitive to DoS, particularly explosion of the state table, and the stage 2 NAT64 in this design is no different.
IPv6 allows the sender to choose any of 2^64 possible source addresses within a prefix. This is a fundamental feature of the current IPv6 addressing architecture.
So whilst it would be desirable to keep statistics on utilisation for each individual /128 address, if an attacker wants to hide her usage she can simply continue to pick random source addresses until the NAT is no longer able to keep track.
She can also respond to traffic to all those addresses, e.g. to complete a 3-way TCP handshake. The NAT therefore has no way to distinguish between genuine and spoofed traffic.
To protect itself, the NAT will need to limit state generation at the level of the /64 prefix, which means the attacker will be performing a DoS against other users on her own network. This can only be traced by the local network administrator, e.g. by looking at NDP tables.
Unfortunately, many service providers do not have ingress filters to prevent source address spoofing, and so the incoming source addresses arriving at the translator may be completely arbitrary.
The problems this can cause include:
To avoid the useless state entries and source port exhaustion, the stage 2 NAT may need to engage some mechanism similar to “SYN cookies” so that long-lived NAT state is not created until after a successful three-way TCP exchange.
UDP traffic cannot be protected in this way, as we have no way of knowing whether return UDP traffic was successfully delivered or not. More heuristic methods may be required.
It could be said that few devices would use UDP without any TCP at all; therefore the successful establishment of TCP from a given IPv6 address could whitelist that address for UDP as well. However if an attacker obtains or guesses a valid source IPv6 address then they can spoof traffic which is indistinguishable from genuine traffic from that address. It may therefore also be necessary to limit the rate of UDP state creation or the total number of UDP states per source.
Some devices (e.g. SIP phones) may use UDP exclusively - although SIP is unlikely to work well with NAT64 anyway. If we allow the successful establishment of TCP from anywhere in a /64 prefix to whitelist the whole prefix for UDP, this is unlikely to be a problem.
Even without address spoofing, a client can create a large number of TCP sockets and a large number of UDP sockets, and consume resources on the translator.
If a cap is set at the limit of the /64 prefix, then the user will be able to perform a DoS against other users in their own network.
If a cap is set at the limit of the /128 address then this can be avoided, however the attacker can easily circumvent this by choosing different source addresses as described above.
TCP sessions can hold state for an extended period of time, especially if the client or server vanish, and may increase utilisation towards the cap. Hence stale sessions must be pruned, at least in times of high demand.
(TODO: can the NAT64 inject TCP keepalives even if the endpoints themselves are not using it?)
If a client binds to one socket and sends to many destinations, we SHOULD use the same translated source port, so that STUN/ICE can work. However if there is much churn of client sockets, there could be much pressure on the available port space, and may have to fall back to shared port use (symmetric NAT).
It is probably realistic to time out UDP translations after 30-60 seconds of inactivity. Clients have an expectation of having to refresh NAT state - although if they are on an IPv6-only network they may not realise that some of their traffic is going via a NAT64.
Ports could be re-used in a LRU order, but this would make problems harder to debug - it is probably better to have a fixed UDP timeout.
A common source of UDP state is DNS. There is no good reason for anyone to use NAT64 to translate DNS queries. An IPv6-only user should be talking to a DNS(64) cache over IPv6, and that cache should be dual-stack. Anything else is misconfiguration.
Therefore it is perfectly reasonable to block UDP and TCP port 53 entirely at the translator - or to return a canned DNS response with the fixed IP address of a webserver which explains the problem, essentially a captive portal.
It is helpful for the NAT64 to work with ICMP echo request. This would mean that an end-user with a CLAT would be able to do “ping 8.8.8.8” and get a response - this means “The Internet Is Working [TM]".
Such state can be very short-lived (of order of 5 seconds) and the number of concurrent states from a given prefix can be limited, and/or traffic heavily rate limited.
A system tuned to handle a certain traffic volume under the assumption of an average packet size of (say) 512 bytes per packet may become overwhelmed given a stream of small packets of 64 bytes, as this will demand 8 times the processing.
There should be statistical monitoring of both traffic (bps and pps) and state generation from active prefixes (aggregated per /64, per /56, per /48, and per /32), and the ability to apply temporary blocks if required.
A static A+P deployment will explicitly tie each address/port combination back to one source, but this may not always be true for dynamic A+P.
An attacker with a /48 route from their ISP can choose whichever public IPv4 address from the pool they want (or at least, any one of 65,536 choices), simply by rotating through their 2^16 available prefixes. If the hash algorithm and IPv4 pool ranges are public they can even do this off-line. This could be used to make an attack appear like it is coming from multiple sources, when they are in fact the same source; it can divide a large volume of traffic into 2^16 smaller streams.
If the target queries each of the IPv4 addresses from the translator logs, they will find that all those addresses include mappings to prefixes within the same /48 range, and may be able to infer the true source of the attack.
Prefix selection can also be used to purposely make the attacker's traffic come from the same IPv4 address used by a different, trustworthy network. To some extent this is inherent to the concept of address sharing, but in this case the attacker is allowed to select their sharing partner.
If the translator cluster has less than a /16 of address space in total then the attacker will be able to find multiple prefixes which map to the same IPv4 address and consume an unfair share of dedicated ports on that address.
To address this issue, we could consider using only the first 48 bits of the source IPv6 address in the consistent hash algorithm. The problem this would cause is when there is a genuine large site with a /48 block (say, a university): we do not want every single network in the university to map to the same IPv4 address, as this would create excessive load in a single stage 2 server (traffic load, state, and demand on available source ports).
We could also consider using 56 bits, given that many ISPs are allocating /56's to end users. Such a user would have no choice over their translated IPv4 address, and a user with a /48 would only be able to choose between 256 of them. However a large network like a university might have to reorganise their prefixes to distribute load among those 256 available translated addresses.
This point remains open to discussion, but from a basic engineering point of view it is still preferable to use 64 bits of the prefix to give an even distribution of addresses for larger client networks.
In any case, users may easily obtain additional /48 blocks (e.g. from tunnel brokers) or even a /32 or more by joining a RIR. At worst there is always the option of blocking traffic from any ranges causing persistent abuse. If there is a need for them, RBL-style blacklists for IPv6 will spring up.
The port selection algorithm is designed to allow “busy” networks to make use of a large number of ports in a shared range.
An attacker can easily open multiple sockets bound to multiple addresses and create as many ports as they wish. This would be a denial-of-service against their own network.
If the particular IPv4 address has one or more legitimate, “busy” networks on it, then the attacker may end up using some port ranges which are shared with those networks. This would be intended to mislead an investigation.
However at worst it would only increase the number of leads which have to be followed - the information that the attacker was using a particular port range would not be lost, only that there are multiple possible users of that port range.
Even if there is a 16:1 sharing ratio, then at worst 16 networks would be using the same extended port ranges; in practice it is likely to be far lower.
In some cases a single /64 prefix may be supporting many more than 250 devices (e.g. a large hotel, or a conference wireless network). In this translator design, they will all be mapped to the same IPv4 address and so will be sharing a single port range, which will suffer severe pressure - as indeed happens today if the hotel built their network with NAT44 and a single public IPv4 address.
It would be better engineering if the hotel were to divide their network into subnets, which would spread the load across multiple IPv4 addresses, or even route a separate /64 to each room.
The mitigating factor here is that if the hotel has built a pure IPv6-only network, then at least connections to dual-stack destinations will continue to work just fine, even if IPv4 ports are exhausted in the translator, rather than suffering a total network collapse.
TODO: Use the proper NAT64 terminology throughout