Ten64 Network Performance Benchmark¶
This is a brief sampling of the Ten64's performance under ideal conditions.
We are being very modest with our performance numbers in advertising as there are some caveats today:
-
The Linux network stack does limit the performance a bit, we see the ultimate solution as XDP in the short term and hardware offload (AIOP) in the long term.
For example, 3 Gbit/s single flow performance is easily achieved, but futher traffic flows may be inconsistent as it depends how the flows are passed to each core (via DPIO portals) based on flow hashing (such as the port numbers on each connection).
-
In the default configuration we ship the units with (all Ethernet ports enabled), performance is limited as the LS1088 doesn't have enough resources (buffers, flow tables etc.) to give optimal resources to all ports - see the network configuration page for more information.
For these tests, we have booted the system with the "eth0 only" DPL and added the two 10G ports dynamically via the ls-addni
command - this ensures they have the most optimal settings in terms of queues, flow tables and inbound packet hashing.
iperf3 testing¶
These tests are done between a Linux client and Windows 10 workstation.
The Linux client has an Intel X520-T2 (Dual SFP+) card, while the Windows machine has an X540-T2 (Dual 10GBase-T). A Mikrotik S+RJ10 10GBase-T SFP module is used to provide a 10GBase-T connection on the Ten64.
Bridge mode¶
Command: iperf3 -P (num threads) -R -c (server) -t 60
Number of threads (-P) | Thread 0 | Thread 1 | Thread 2 | Thread 3 | Thread 4 | Total Gbit/s |
---|---|---|---|---|---|---|
1 | 3.08 | 3.08 | ||||
2 | 3 | 2.93 | 5.93 | |||
3 | 3.04 | 2.88 | 2.56 | 8.48 | ||
4 | 1.87 | 2.69 | 1.75 | 2.73 | 9.04 | |
5 | 1.91 | 2.03 | 1.84 | 1.68 | 1.84 | 9.3 |
As discussed above, the per-flow performance does vary more as flows are increased - this can be due to different packet flows being processed on the same core, while the fastest flow gets a core by itself.
Routed mode¶
Note: this test is done with static routing rules (no NAT)
Number of threads (-P) | Thread 0 | Thread 1 | Thread 2 | Thread 3 | Thread 4 | Total Gbit/s |
---|---|---|---|---|---|---|
1 | 2.91 | 2.91 | ||||
2 | 2.81 | 2.6 | 5.41 | |||
3 | 2.64 | 2.52 | 2.15 | 7.3 | ||
4 | 2.33 | 2.36 | 1.97 | 2.39 | 8.3 | |
5 | 2.17 | 2.07 | 2.07 | 1.46 | 1.37 | 9.3 |
DPDK¶
DPDK is one way to extract higher performance from the LS1088, by bypassing the kernel and doing all packet operations in userspace, using a poll-mode driver (PMD) mechanism rather than the traditional interrupt-driven mechanism in the kernel.
It should be noted these tests are only being done with a single core only.
To do this test, we just launch DPDK's testpmd
which establishes a Layer2 bridge across a pair of ports.
Number of threads (-P) | Thread 0 | Thread 1 | Thread 2 | Thread 3 | Thread 4 | Total | Improvement (relative to Linux bridge) |
---|---|---|---|---|---|---|---|
1 | 5.02 | 5.02 | 1.63 | ||||
2 | 3.99 | 3.93 | 7.92 | 1.34 | |||
3 | 3.09 | 2.96 | 3.16 | 9.21 | 1.09 | ||
4 | 2.56 | 2.26 | 2.44 | 2.17 | 9.43 | 1.04 | |
5 | 2.62 | 1.73 | 1.69 | 1.76 | 1.69 | 9.49 | 1.02 |
As we can see, even with just one core doing the work, DPDK can improve the single thread performance to 5Gbit/s, a 1.6x improvement, and DPDK is still able to hold an advantage as more flows are added.
The disadvantage of DPDK is that you need to port your entire network stack to DPDK, which runs in userspace. Some options exist, such as DANOS, TNSR etc, as well as OpenVSwitch-DPDK - but deployment and validation of these solutions requires a lot more time and effort over traditional kernel network stacks.