A modified driver for Connect-IB and ConnectX-4 instrumented to measure PCIe
bandwidth use. This driver is based on libmlx5-1.0.2mlnx1
from Mellanox OFED
3.2.
Warning: This replaces the existing driver.
./autogen.sh
./configure
make
sudo cp src/.libs/*-rdmav2.so /usr/lib/libibverbs/
- Base commit = f6dfece0176149bfd6c414475e75bcaedb6d830b
- Initialization for stats is done during driver initialization in
mlx5_driver_init()
. This function gets called once per multi-threaded process. - To avoid modifying the
verbs.h
header file, we use a rarely-used verbs function to print stats. The user callsibv_detach_mcast(*qp, *gid, lid)
with a valid queue pair pointer,gid = NULL
, andlid = 0xffff
. Using a valid GID and LID lead to regular behavior.
- Assumptions:
- RDMA reads are not issued.
- RDMA write and SEND payloads are inlined.
- PCIe MaxPayloadSize and MaxReadReq ignored. (For payloads that can be inlined, these parameters should not matter.)
- Inline RECV and CQE compression is disabled.
- The measured PCIe bandwidth use is fairly, but not completely, accurate.
Using hardware PCIe counters can give more confidence in a measurement.
Two factors that are impossible to account for at the driver, and lead to
measurement inaccuracy, are:
- RECV descriptor fetch: The number of RECV descriptors fetched by the
NIC in one DMA read depends on the NIC firmware and workload. We assume
that all the RECV descriptors posted in one
post_recv()
call will be fetched in one DMA read. - WQE re-fetch: When BlueFlame is enabled and postlist is used, the driver writes the last WQE in the postlist to the NIC using MMIO. We assume that the NIC re-reads this WQE later. This is not always true, and can be checked using PCIe counters.
- RECV descriptor fetch: The number of RECV descriptors fetched by the
NIC in one DMA read depends on the NIC firmware and workload. We assume
that all the RECV descriptors posted in one