Figs. 12(b) and 12(c) show the core reason for this degradation: memory alloc/dealloc becoming more prominent in CPU consumption at both sender and receiver (now consuming 30% of CPU cycles at the receiver). This is because of two additional per-page operations required by IOMMU: (1) when the NIC driver allocates new pages for DMA, it has to also insert these pages into the device’s pagetable (domain) on the IOMMU; (2) once DMA is done, the driver has to unmap those pages. These two additional per-page operations result in increased overheads.
выглядит как возможность для оптимизации. Ну аллоцируй один раз и оставь так, DMAшь всё время туда. Scatter-часть из sg видимо теряется, но судя по такому описанию она здесь и не применялась