-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add NVMe SGL support for FEMU #129
base: master
Are you sure you want to change the base?
Conversation
Could you share your setup and results? Under multi-poller mode, multiple small memcpy() can actually saturate the memory bandwidth. |
I first conducted performance tests on memcpy by allocating a memory pool and randomly performing memcpy operations with sizes of 4KB or 1024KB. // memcpy() performance testing
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (1024 * 1024) // 1024KB, 4 * 1024 for 4KB
#define COUNT (100000)
#define mb() asm volatile("mfence":::"memory")
int main() {
int Pool_Size = 1024*1024*1024;
char *src = malloc(Pool_Size);
char *dst = malloc(SIZE);
if (src == NULL || dst == NULL) {
fprintf(stderr, "Memory allocation failed\n");
return 1;
}
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < COUNT; ++i) {
// get random offset in src memory pool
long long offset = (random()%(Pool_Size/SIZE)) * SIZE / 8;
memcpy(dst, src + offset, SIZE);
}
gettimeofday(&end, NULL);
long long total_time = end.tv_sec * 1000000 + end.tv_usec - start.tv_sec * 1000000 - start.tv_usec;
double average_time = (double)total_time / COUNT;
free(src);
free(dst);
printf("Total time: %.2lf seconds\n", (double)total_time/1000000);
printf("Average time per memcpy: %lf us\n", average_time);
return 0;
} The purpose of this script is to simulate the environment in which FEMU operates. The test results are presented in the table below. The results indicate that repeatedly executing memcpy with a granularity of 4KB 256 times is less efficient than performing a single memcpy operation with a size of 1024KB.
I also conducted tests in FEMU, and the system configuration is as follows:
I separately measured the execution time of the
In the test results, there is negligible difference when performing 128KB I/O. However, for larger I/O sizes, a slight performance gap is observed between PRP and SGL. This is unlike the previous simulation tests, which exhibited a significant disparity. During my investigation of performance bottlenecks, I identified this issue, but it wasn't prominent. I made attempts to modify and test it, eventually confirming that it was influenced by other issues (PRP was not the direct cause). The experiments indicate that in FEMU, with multiple pollers enabled, the performance difference between SGL and PRP is not very pronounced. In a multi-thread scenario, the latency timing model and the queuing of I/O requests in FTL can overshadow the minor performance gap in DMA(memcpy). However, there might be greater significance in adding SGL or other features of NVMe. If you're interested in this pull request, I'd be happy to further validate it and make modifications :) |
I noticed that the NVMe module of QEMU didn't support NVMe SGL officially when FEMU was first introduced. Now the latest QEMU has added this feature.
Now FEMU uses NVMe PRP to split a large I/O (128KB, 512KB, 1024KB) to many 4KB PRP entry (align with OS physical memory page), and the dram backend will repeat 4KB DMA(
memcpy()
actually). In my testing result, doing a 1024KBmemcpy()
is more efficient than repeating 4KB DMA 256 times, and SGL will perform a larger size ofmemcpy()
with fewer operations. This may result in a loss of performance.The modification of code is based on hw/nvme/ctrl.c, no change to the current code structure. The current code in FEMU has a lot of incompatibilities with the latest QEMU NVMe module, which I have modified appropriately.