I am trying to achieve a 40Gbps, single TCP flow, throughput out of a40Gbps NIC. To prevent any disk latency, I have created a RAM disk of 80GB on the server and on the client. Then I have created a 40GB file in this RAM disk that is also the root of my ftp server. When I do the download of this 40GB file on the client, the speed never goes over about 12Gbps while I would expect to saturate the NIC at 40Gbps. Doing some further analysis, I have done a copy of the file residing in the RAM disk to the same RAM disk, only changing the name. The result is that it takes almost the same time it takes to send the file via the 40Gbps NIC to the client. My conclusion was that the memory sub-system can not write in such speed. However, if I start a second download at the same time, pinning the download to a different CPU, then the speed just doubles. And if I start a third download it triples.
Based on this observation, I believe that the problem is that every core on a CPU must have a defined memory access time slot to prevent starvation of memory access and also to prevent a single core to "block" the entire system from working properly. However, this memory slot is not long enough to produce the 40Gbps that I would like to have on a single stream. To test my theory, I have removed the second system CPU (socket) and disabled all cores, except 2 ones. I did the same test and found that the in memory file copy would gain over 40% speed.
All this said, I "think" that the best way to achieve a very high throughput in such scenario should be by disabling as much as possible all the cores so the remaining ones would have a longer time slot to write on RAM.
My questions: Am I my missing something or this is the way it is? I believe that the time slot control for memory bus access is strictly controlled by Intel CPU. If it is not, can you give me some hints on how to tailor it?
Thanks in advance for your feedback.
Moacir