This project was performed at Intel in support of Intel helping Telco's transition to a more computer centric internet communications network from the older highly specialized network communications equipment.

Although the focus of this was for network IP packet processing, the same methodology can be used to support other high throughput and high performance processing requirements, such as Realtime Signal Processing.

Note: The performance and characteristics are from my memory from some time back. I am trying to be accurate but no longer work at Intel and do not have the data to verify accuracy, nor does this represent operation for every possible system setup. I believe they are repeatable with correct system and realtime VNF setup. However, all comments are my opinion and not representative of anyone else. The goals of this work was external to Intel in support of the top level expert Telco developers using IA servers, which need to accurately understand the actual performance and characteristics. This also contains some general real world discussions, both positive and negative that typically occur during such projects, to give a more round picture rather than just boring facts and figures.

VNF Realtime IP Flow Logger

The realtime flow logger is a high speed realtime NFV (Network Function Virtualization), sometimes also called VNF (Virtual Network Function), packet processing program that logs communication flows through an Ethernet interface by putting a logging task in the middle of a high speed 10 GbE (Gigabit Ethernet) communications path.

Goals

  • Demonstrate a baseline high throughput realtime packet processing NFV with basic needs that can be used as a starting point for a more complex NFV program. This has realtime and non realtime component execution as well as a control and status TCP/IP interface that does not interfere with the realtime component performance.
  • Provide information for the Telco expert on how well the IA server handles a large number of realtime NFV programs either as VMs (Virtual Machines) and/or as host tasks and the general performance impacts to one another.
  • Provide information for the Telco expert on how well the IA server handles a large number of 10 GbE interfaces used by high throughput realtime NFV programs, up to 16 10 GbE interfaces connected to a single Xeon CPU, and 24 total in a system for this example case.

Overview of Realtime Flow Logger

Figure 1: Single Flow Logger Realtime Packet Processing NFV shows the NFV inserted into a 10 GbE path either by changing the physical cabling or by some form of Ethernet switching or routing. The Windows TCP/IP Client display program runs on a PC (could be a laptop) and connects to the TCP/IP server (server supports multiple clients) over a Ethernet network and requests current flow information for display, but does not need to be connected for the flow logger realtime task to run. The client was set to request flow information every 1 second which worked very well, but was an arbitrary time. The TCP/IP server was written in a way that it could step through the flow table and copy flow data without affecting the realtime performance.

Flow Logger

Figure 1: Single Flow Logger Realtime Packet Processing NFV

Each flow block has current packet and byte counts with a time count difference (ms) from the start of the flow and one global flow block per flow block set, that has overall interface packet and byte counts and some additional time information, which overall gave high accuracy flow information for display of the flow data and it's derived data on the monitor. The base data rates achieved 5 digits of throughput accuracy for the 1 second logging rate. The individual flows were displayed on a table, but only a small scrollable table could be shown on the monitor.

The primary function of the flow logger was not logging flows but for testing the processing performance of an Intel IA Xeon server for doing bi-directional high throughput realtime IP network packet processing for an NFV type application. Although a system could be setup to only run one NFV, it is desirable to run as many NFVs as there is processing resources for, hopefully without any devastating impact of performance on each other. An NFV could have one or more hyper-threaded and/or full CPU cores doing realtime processing. Hyper-threaded systems can have full rate core performance by leaving one of the cores of the hyper-threaded core pair idle, allowing flexibility in runtime operation when the IA server is configured for hyper-threaded operation.

This program came about from wanting to have more realistic process loads for performance testing than using simple realtime packet pass through programs, which just gets packet from each side and moves them to the other side. I extensively used pass through NVFs (generally in a VM) for earlier high speed vSwitch (Virtual Switch) optimization and testing efforts. However, the simple tasks did help keep the focus on communications input and output throughput of the realtime vSwitch and throughput of the realtime programs without the possibility of be bogged down of the CPU cores from intense packet data processing. However some NFVs executed code in some of the test NFVs that helped measure both the input and output communications overhead for the same NUMA node and for moving packets across the NUMA nodes, to determine the separate input and output CPU load cost of the of the interface. The interface being used for the vSwitch test was DPDK Queues, compiled with the input and output variables on separate cache lines (big improvement when multiple cores, such as simultaneous input flow from two 10 GbE interfaces processed by separate CPU cores, are writing into a single DPDK queue at high speed). The DPDK Queues and packet buffers were both in globally shared memory (IVSHMEM) where they were passed between NFVs by just writing the packet pointers into the DPDK Queues.

Realtime Flow Logger as an IA System Performance Demonstration

Running a large number of high speed realtime NFVs in a single IA server makes a good demonstration of server performance. Figure 2: Full Server Running 12 Realtime Flow Loggers shows some arbitrary equipment, could be any group of computers, data center, local area network, or even Telco communications equipment, communicating through a group of 10 Gigabit Ethernet interfaces. It does not matter what the destination IP or Ethernet port since the ports are operating in Promiscuous Mode allowing all of the data to be received and sent without change. In this example, 12 two 10 GbE port NFVs are logging the bi-directional data of 12 communications connections. One laptop is running 12 Flow Log Display programs and displaying information about the data going through the interfaces. A 50" 4K TV with same visual area, pixel count, and pixel size as 4, 25" 1920 x 1080 monitors is used for display. (I use a 49" 4K TV on my development system. I wonder why computer retailers are trying to sell 27" size range 4K monitors and not larger, like a 50" range size for computer use. I see little value in the small pixel screens, but I often see the world a little differently than many people.)

Flow Logger Demo

Figure 2: Full Server Running 12 Realtime Flow Loggers

This particular configuration was never used, but I did have 6 log display programs running on my laptop at the same time over the corporate network without any communication rate issues. However a Spirent network test system was used to test the performance, where all of the 10 GbE interfaces were daisy chained for each test. The tests started with 1 NFV and an NFV was added for each test until all the interfaces were used. Tests ran the NFVs as VMs, as host NFV tasks, using non-hyperthreaded core pairs, and using hyperthreaded core pairs with pairs from same core, for a total of 4 data sets. Of course the NFVs was ran with 1 Gigabit Pages for both host and 1 Gigabit VM pages, and all the management that allows proper realtime operation, but am not going to cover correct realtime operation details in this document.

The base server was a 2 socket Supermicro server with 2 Haswell 14 core CPUs and an 8 Gigabit memory module in each of the 8 total memory channels. Due to the system configuration, there was 2 Fortville Ethernet cards connected to NUMA node 0 and 4 Fortvilles connected to NUMA node 1 and with latest i40e (Fortville) Ethernet device drivers installed in the host and VM Linux kernels. A better configuration would be 3 Fortvilles per NUMA node, but the server slot hardware was not setup that way, but did allow testing of 16 10 GbE ports on a single NUMA node. Each Fortville was a 4 10 GbE port card with the latest firmware installed and used SFP+ optical interconnects. (Note: The latest Fortville firmware and device drivers should be mainline by now.) The system ran Linux CentOS 7 with boot reserving the 1 Gigabit pages and CPU cores used for the realtime processing. Most servers and Linux OSes with recent kernels should work equally well, with some exceptions. Load balancers and other programs that do not follow the mainline Linux CPU and Memory management rules can interfere with the performance, so some care must be exercised to not add or to eliminate these types of tools or programs.

The cross NUMA node communications needs to be carefully controlled or the system easily thrashes, where trashing is detrimental to the realtime CPU performance. I have observed NFV cross NUMA node thrashing start at around a 50% packet rate (compared to not going across a NUMA node) and instantly drop down to somewhere around 25% data rate (effect of thrashing) and not budge, which probably has system wide ramifications on the socket to socket communications bus. Ideally the realtime packet processing NFVs should be ran on a single socket server, with 12 to 16 10 GbE ports. Port interfaces that take more than 1 optical pair are very expensive and have limited advantage since speeds larger than 10 GbE maybe faster than a core can effectively handle with small packets.

I suspect that since thrashing is so detrimental to performance that dual socket systems have lost their advantage with CPUs with lots of cores where in the past dual processors with small core counts worked well. There is a good chance that a new high core count system running a moderate load of general processing will easily start thrashing from inter processor communications and that a single socket system with the same total memory will not thrash from socket to socket communications and might actually outperform the dual socket system. But I have not tested this theory.

The 10 GbE rate is a good match for current day IA core performance even with a lot of small packets based on the packet switching/handling rate, where additional processing is spread across other cores where core to core communications is through DPDK Queues. The upcoming single optical pair 25 Gbit Ethernet interfaces with 2 ports per 8 bit PCIe slot will probably be OK for a connections with an average packet size of 256 bytes or larger but would be a poor choice if there are a large number of small packets. SRIOV operation can spread communications across multiple cores, however SRIOV operation would be very undesirable due to multiple SRIOV interface packet duplication, for Telco packet processing NFV's hat require using Promiscuous Mode.

Realtime Flow Logger Task

Any NFV task is going to have a minimum amount of work to associate a packet with a flow. The flow logger is based on this minimum amount of work needed for most any NFV that needs to match a packet to a flow. The simplified flow logger task shown in Figure 3: Simplified Flow Logger Task, only shows the main operations and leaves out initialization, some of the minor details, but has a close but non-exact processing pattern in the diagram.

Flow Logger Task

Figure 3: Simplified Flow Logger Task

The Flow logger packet processing begins by getting a list of packets from the input interface or queue. For each input packet, the 5 Tuple communications data for the Source IP Address, Source IP Port, Destination IP Address, Destination IP Port, and packet type (TCP, UDP) is put into a packed structure. A hash is created by initializing sum with arbitrary but constant value and summing of the 16 bit words of the structure. The hash is masked to match the hash 2n table size with a maximum of n = 16, which size is set at initialization time. Starting at the hash table position from the masked hash, the linked list of flow structures is searched for the target flow, comparing with the full hash value, then comparing the full 5 tuple structure. If the target flow was not found, a new flow structure from the flow free list is obtained, initialized, and added to the flow hash table linked list. The flow structure is updated for packet count, byte count, and the current time tag. The global packet count, global byte count, and global last time stamp is also updated and processing continues to the next packet.

Once the packet list is done, it is put into the output interface DPDK queue of the other interface, or just the output DPDK Queue, if not using a physical interface. A vSwitch is just a little more complex which generally includes more packet variables in it's Tuple, has an output destination port in its flow structure which is set at flow structure initialization time, and sorts the output packet list and puts them into their associated output destination interface DPDK queues or DPDK output queues. Back to the flow logger, if the output is a DPDK Queue, the output is complete since the Queue is the destination. For the 10 GbE interface, a list of packets from the output DPDK queue for the associated output interface is de-queued and is moved to the physical output interface, possibly delaying output for a couple I/O loop cycles as discussed next.

I have gone round several times with various people about correct packet processing at this stage, which a good method was determined from vSwitch code optimization efforts, but at the time didn't know exactly what the TX problem (covered later) was and why it worked. People tend to copy the DPDK L3 processing example, which has a delay to wait for packets to accumulate for the output, when less than 32 packets are available. Earlier examples would not output until 32 packets were available, but this often caused failures of 2544 zero loss tests at any packet rate, due to packets loss at the end of the test, so was changed. This delay causes an unstable and substantially large maximum latency, often upwards of 75 us (but depending on the delay setting), for almost all data rates. A simple efficient algorithm, if there is a very small number of packets, less then say 4 packets to output, wait 2 or 3 loops for any more packets to accumulate and then output the packets. For all except the highest throughput processing speeds, which the latency increases substantially, the maximum Fortville's 10 GbE interface latency stayed solidly below 10 us for the medium size packets with a small increase with larger packets. This Ethernet communications latency time includes:

  • Ethernet serialized bit transfer time
  • Ethernet physical layer 1 hardware receive management overhead
  • NIC interface controller RX management processing
  • Getting PCIe bus ownership wait time
  • PCIe transfer time writing packet to the systems cashe/memory
  • Write RX descriptor for packet on system, by controller doing PCIe write
  • Realtime Task processing, including TX descriptor update and NIC register write
  • Get TX descriptor for packet from system, by controller doing PCIe read
  • Get PCIe bus ownership wait time
  • PCIe transfer time reading data from the systems cache/memory
  • NIC interface controller TX management processing
  • Ethernet physical layer 1 transmit slot and preamble start wait
  • Ethernet serialized bit transmit time

Other NIC controller work occurs, but might not affect the latency, and some items are left out, such as the architecture specific internal L2 switch needed for SRIOV, that is always executed even though it is not exploited by this configuration. The latency time is reported by a Spirent Test System, which uses hardware in their interface to put a time stamp into the packet data portion as it is being transmitted and then using hardware, reads the packet time stamp when receiving the packet for an high accuracy packet latency measurement. There are several methods to calculate packet latency, that a have different result, but will not cover that here. The minimum latencies were just below 5 us, which is the combined total cost and overhead from 10 GbE RX through TX operation including being processed by the CPU flow logging task.

The simple task of the Flow Logger is used to remove flows that timeout from not having received a packet for some time and is only executed if 2 or less packets were received or the simple task has not executed for several input/output loops. This essentially checks for flows that have finished communications by not having any packets transferred for a long time period, usually in the order of 30 to 45 seconds. These flows are de-queued, however if the TCP/IP task is accessing the flow list, the next pointer remains unchanged to allow the TCP/IP task to continue down its linked list, and flow structure is put on a waiting list for freeing when the TCP/IP task is not accessing the flow list. Later when the TCP/IP task is not accessing the flow list, the flow structure is moved to the flow freelist. When a flow times out and no TCP/IP access activity is occurring, the flow structure is moved directly to the flow freelist.

A killer of performance for a vSwitch is the vSwitch doing a copy of the whole output packet from it's current buffer to an output destination buffer, but this really depends on the maximum packet rate and minimum latency that needs to be achieved. Any additional packet modification work such as GRE routing encapsulation or encapsulation are process intensive work that can greatly affect vSwitch maximum throughput and probably should be offloaded to other cores unless the vSwitch performance is still acceptable at the lower packet rates. Processing is not free for both CPU processing load use and power usage. For production Telco communications, a minimalistic approach that meet the needs should always be a high priority.

For Telco packet processing, where communications equipment cost and power usage needs to be controlled, the processing throughput needs to be maximized to reduce the $ cost per packet. From this I believe the simplified statement that for Telcos, "Performance is Everything", and without performance, you have nothing of interest. Of course accuracy (no errors) is just a plain requirement.

The original flow logger code was carefully written for performance from the start. An optimization that was carried out early was to process 2 packets at the same time, where accessing the hash table, or flow structure, a Cache 0 prefetch was done with a switch made to processing the other packet and back and forth for each Cache 0 prefetch.

For top performing code, Cache hits at all Cache levels can substantially impact processing performance and need to be carefully considered. Cache 3 hit requires a memory access and the CPU stalls until the memory access completes which can be on the order of 100 ns, which can be 300 instruction cycles for a 3.0 Gigahertz CPU. Since the memory is shared, other memory accesses such as cache writes and other memory reads can cause stalls to be much longer.

IPv4 and IPv6

The flow logger was programmed to only do IPv4, which supports the highest packet rates since the minimum size IPv6 packet is larger than the minimum size IPv4 packets and the bit rate for an interface is constant. To handle the additional 5 tuple size of the IPv6 will take more CPU clock cycles, but the maximum packet rate for IPv6 is lower. The flow logger could be changed to support either or both IPv4 and IPv6 mixed and at the same time, which will have a small additional overhead and modestly lower maximum packet throughput rate.

Initial Single Task Performance

The flow logger NFV was running at a upwards of 15 MPPS (Million Packets per Second) when running as an NFV using a single non-hyperthreaded realtime core when receiving and sending data using a single bidirectional DPDK queue. In this case, the vSwitch was switching packets from the input of two 10 GbE NICs for bidirectional flow into the NFV RX DPDK Queue, and the vSwitch picked up the packets from the NFV TX DPDK Queue and switched the packets to outputs of the two 10 GbE NICs.

In a meeting with AT&T and Spirent when I mentioned my test logging NFV which I was using for testing, and the AT&T individual hinted that it is something that would be desirable in their NFV test bed. Afterwards, to support the possible use, the NFV was reconfigured to use 2 Fortville 10 GbE ports instead of the DPDK Queue and ran a test. Found I was getting somewhat around 22.5 MPPS 2544 zero loss rate. This was lower than my vSwitch rate, but it was also updating the per flow statistics data in the flow structures, located in a separate cache line for each flow, which was not being done with the vSwitch. It is usually more costly to write into a cache line, especially a partial cache line write, than cache line data reads. However some of the loss of the measured zero loss rate could possibly due to quickly creating and adding of flow table entries on the fly, which was not included in the vSwitch since it took a long time to establish flow entries and so the vSwitch timeout was set large enough to bridge the Spirent test to test times, and an extra test was put at the front of test just to establish the flow entries before any of the full high speed tests were performed.

Data movement in and out of the 10 GbE device driver interfaces uses considerably more processing power than in and out from a DPDK Queue. Using a DPDK Queue would have put the dual core rate up over 30 MPPS, which was above the maximum 10 GbE bi-directional rate for 64 byte packets of 29.762 MPPS. I had carefully written the NFV, although I had did the simultaneous 2 packet optimization mentioned earlier just in course of programming, I had not gone through any optimization rounds of measurement, change, and measurement iterations that was standard realtime Telco practice for trying to improve realtime performance, which the technique had been used on the vSwitch. I was used to the higher speed vSwitch numbers and didn't like the measly 22.5 MPPS rate and wanted to approach AT&T with discussion of a little better rate.

The device driver was using a considerable portion of the clock cycle count and improvements there could apply to more than just the test NFV. Decided to try the iterative realtime optimization method on the i40e 10 GbE Fortville device driver.

Device Driver Optimization

Began optimization tests on the RX side of the Fortville i40e DPDK device driver. Everything tried did not change the process throughput rate either up nor down. A dozen or so changes were tried with no progress. This was very discouraging.

Finally moved to the TX side and started making progress. Often a change would not make a 2544 zero loss measurement rate change, but sometime the average latency changed or improved. However many times, 2 code changes would make a 2544 zero loss measurement rate improvement (there was a test data rate granularity factor from the test equipment may have influenced this effect, but latency results gave a hint).

What was going on, that the RX rate side did not change with a change in RX device driver code? The device driver receive side only does a PCIe device register write when at least 32 RX interface descriptors are empty, and the code loads the packets from the local buffer cache and does a PCIe register write to advance devices pointer to the new empty buffer position. This only occurs a maximum of 1 time for every 32 packets processed. However at certain processing rates the task is picking up one packet at a time, processing it and sending it to the 10 GbE TX interface. My earlier optimization test and improvements changed the output algorithm to only output after collecting several packets or waiting several process rounds before moving packet to the TX device driver (optimization tests had shown that it worked to improve throughput but I did not know why it worked). TX processing sequence starts with TX descriptors getting cleared for the completed packets, the new packets get loaded in the TX descriptor table and a PCIe write to a NIC register tells the NIC where the new end of the packet table at. However, the PCIe register write is only started, the PCIe register write time is quite large, and the actual access to the NIC has to compete with use of PCIe bus of transferring packets to and from the NIC and the NIC updating the 2 RX and TX descriptor table's data, and can only complete the NIC register write when it gets use of the PCIe bus. During this wait time for completion, the next group of packets is being processed. If the TX processing completes and goes to write the new TX descriptor table position before the previous NIC write had completed, the processor has to stall and wait for the previous write to complete before starting the new register write since only 1 write to the same register location can be pending. When the processor is stalled, it does not continue processing, and processing clock cycles is lost and throughput performance drops. The effect of the NIC TX register write stalling was controlling the packet throughput of the process, and changing the performance of the RX side didn't change the throughput.

Attempts to write to the TX side for every RX group of packet received and processed gets poor performance from the TX end position NIC register write stalls, which has resulted in other methods such as a long delays waiting for 32 packets used to collect, before being output. The DPDK L3 sample code uses this method with results in large (around 70 us at the time I was doing these tests) maximum latencies for all packet rates. This poor method to solving the NIC write stall is spread through many code samples including the OVDK source code (I had changed it several times, but other developers kept changing it back).

This misunderstanding still currently persists in the DPDK documentation as I am writing this, such as the PMD Driver section of the document Writing Efficient Code.

After this NIC write stall was accounted for and some TX improvements (which included changing the code to only have 1 NIC write even when over 32 packets were being output), was able to make optimization progress on the RX side. The number of optimization combinations quickly became large and unmanageable in the git repository, so started copying the Spirent test results and code changes together in a word document (easy to copy back to the source code). Overall I was probably getting about 6 tests a day, where a code change maybe took 15 minutes, then build, realtime task start and Spirent test start. Checked back later and stopped Spirent when it was past the 64 byte test, saved and record results and code changes, and on to another improvement attempt.

Overall was able to get the packet rate from around 22.5 MPPS to just breaking the 27 MPPS rate. However one day shortly after the end of the optimization cycle, the data rate was suddenly was up to just over 29 MPPS, at 97% percent of full band width. I was often updating the DPDK package, and someone was writing a new i40e device driver optional driver subroutines that used vector instruction and had bug that accidently caused that device driver TX and RX subroutines to be selected (the next DPDK update had it deselected). The 29 MPPS beat my 27 MPPS and was happy with 29 MPPS rate and didn't think I could squeeze any more performance out of my version, so changed my work focus to the next stage, and put i40e optimization to the side.

Open Source Optimization Code Issue

Optimization is a tough issue for the Open Source since it is not actually changing the basic algorithm, so people have trouble getting past why change it if it works. Open Source code management people want easy to follow and understand patches, which means simple patches. Often times the whole processing structure of the whole code base needs to be changed before effective optimization can be applied and see performance gains (that is what happened when I was working on the OVDK vSwitch, which why I had my own version in the end). This means that many of the small easy to understand, desired to be small patches, need to be applied before any performance gains can be seen, and these some of these changes actually lowers the throughput performance without the additional patches applied.

After figuring out the optimized code for the i40e driver, it had to be total rewrite the code changes as a series of simple easy to understand patches, which took much longer then writing the code in the first place. Besides that, it often it took an average day of time to prepare each simple patch, retest the patch make sure it executed as expected, run it through a code review (required), run it through a legal code check against other open source code to verify that it is probably not copied from somewhere, and submit the patch. Very, very, costly just to get a single simple patch submitted.

I was trying to get the patches through the submission, where my best patch got over 1,000,000 packet rate improvement, but this is only (1.0 + 22.5)/22.5 = 1.044 or 4% improvement. However the OVDK Open Source manager was testing it using the DPDK L3 sample code (the test series was part of a nightly build and test), which didn't see any improvement. The DPDK L3 code is simplified code which has its own characteristics, and being smaller code size, might have totally execute totally out of cache 0 only memory, which affected my improvement. It was also already running at maximum rate, which means you cannot improve performance. So I had to spend some days run some tests with the DPDK L3 sample code and try and convince of the improvements. In the end the DPDK Open Source person declared that he was not going to take any patches unless it doubled the performance. My measly improvement for the total patch set of 22.5 MPPS to 27 MPPS was far from a 100% improvement.

I had brought up the issue with one of the important DPDK leaders at Intel that there was a problem with acceptance of patches, but never really got to explain in detail what the real problem was. Nothing was resolved, and I think this might have even resulted in giving that person a more negative impression of me.

In the end after wasting a considerable amount of time over this, I had found the better driver solution at about this time, and I had to just move on and forget about submittal to the DPDK code base. It also left me feeling like I had wasted a bunch of time, even though I had indirectly arrived at a high throughput solution and had gotten some good optimization experience. I also conclude that optimization efforts, even though it is very important to the Telco industry, had a very poor showing for my professional career since lots of time was spent, and nothing substantial to show for the accumulated 2 to 3 man months that I had spent in this driver optimization effort.

Overall, unless you are the primary contributor of large patches of new code sections are readily being accepted without question, and without extensive internal code reviews and corporate submittal rules and hindrances, Open Source programming is really not a very productive programming environment. Doing a large internal project and later dumping it to Open Source is probably much more productive method.

Passing of a Single 10 GbE to a VM Problem

In past attempts, I was needing to pass 1 or 2 Fortville 10 GbE interfaces from the 4 10 GbE port Fortville card to a VM. However I was only able to pass the whole physical PCIe interface to the VM, which gave all 4 ports to one VM. The Fortville only has one controller which manages all 4 physical devices rather than 4 individual and separate devices. I had looked into the specifications for PCI pass-through and in theory, I should be able to pass the individual ports to separate VMs. In the past I had gone to the Fortville test team and discussed this issue. I had not heard anything, so went back to the Fortville test team which they informed me that the problem had be solved and all I needed was to use the latest Linux i40e driver and update the Fortville firmware to the latest version, and provided version information for driver and firmware.

I updated the firmware of a Fortville card, verified the ability to pass the individual 10 GbE ports to the various VMs and confirmed the full throughput operation with the Spirent, and updated my Intel internal wiki page for Fortville, with method to confirm the right firmware version, getting firmware, and programming information, etc. Then updated the firmware on the other 5 Fortville cards to be used in test, and also a bunch of other Fortville cards that had not been installed and were located in some of the other test systems, a somewhat time consuming task.

Preparation of the Test VM and Test System

Had a working system, but the Linux OS was last years, installed the then new CentOS 7 on the server, a moderate job with the development tools, DPDK libraries, run an OS update, install the new Linux i40e device driver, carefully document procedure and create a CentOS 7 wiki page to go along with the other wiki pages for several other Linux operating systems and versions. Repeated the procedure for a CentOS 7 VM.

The server had 8 10 GbE Fortville ports on NUMA node 0 and 16 10 GbE Fortville ports on NUMA node 1, and since each NFV was using 2 ports, up to 4 NFVs would run on NUMA node 0 and up to 8 NFVs could run on NUMA node 1.

Realtime resources must be allocated on system at boot up. In order for a VM to have one 1 Gigabit Hugepage, it must be started with 3, 1 Gigabit pages. NUMA node 0 was set with 12 1-Gigabyte Hugepages and NUMA node 1 had 24 1-Gigabyte Hugepages reserved. Needed up to 8 cores for NUMA Node 0, and would ideally need 16 cores for NUMA node 1, however the CPUs being used only had 14 cores and some cores must remain for OS cores as well as be used for the OS operation cores of the VMs. NUMA node 0 cores 4 to 13, and NUMA node cores 16 to 27 and their associated hyper-threaded cores were reserved on the boot line. To run at a core at full performance just requires the associated hyper-threaded core not to be affinitized to any task and be left idle. This allows whatever combination of non-hyper-threaded or hyper-threaded realtime NFV tasks to be ran at convenience. Since only 12 cores were available for the 16 10 GbE interfaces on NUMA node 1, only 6 NFVs could be ran with full cores, but 8 NFVs could be ran using hyper threaded core pairs.

Overheating of the CPU cores

I was testing with 3 VMs running in the test and started getting real bad results. It took me a day or so to figure out what was wrong. The CPU cores being used was running at a temperature 85C to 90C and some kind of failsafe was running which caused the CPU to stall for short periods of time before resuming, presumably to allow CPU to cool down some and not let it get hotter, causing the program not to maintain realtime throughput rates. This test was using the i40e 10 GbE device driver that used vector CPU instructions, which perform parallel operations that take more power (watts) than most instructions.

I noticed during the process, the system fans were not running hard, like they do for a brief time during system startup. Called around and finally contacted Supermicro which told me of an obscure BIOS setting that controlled the rate in which fans change to change in CPU temperature. Had to go into the lab and shut down and reboot into the BIOS and adjust. I set the fan speed control up to a much higher setting. When it was rebooted the fans were continually screaming at a noisy high speed level, but didn't need to be in the lab constantly. The fans were running a little fast but didn't know what the cooling effect would be. Checking the systems power usage sensors, found that the fan setting were now causing the fans to use about 100 watts. Restarted the test and logged the power and CPU core temperatures at start and end of each test (1 1/2+ hours to do each full Spirent 2544 test) and the CPU core temperature was staying in a reasonable range. Completed the first test, which took a several days (4 tests/day) with fans running hard. It was a gamble, to lower the speed of the fans, but several time consuming BIOS setting and reboots got the speed down below an ear piercing level. During the tests, the maximum CPU core temperatures stayed 65C and below, which is higher than ideal, but research indicated that it was acceptable operation temperature.

This overheating explains what happed to another system that I had left the realtime NFVs running when I when home for Christmas vacation and was dead when I returned after New Years. Messed with it a day or two to try to get it to work and finally had to send it in to get fixed.

To be Continued

Still have some good information and data to add, but out of time, need to finish some other work before can continue.