On any given day, in a stock 68K system (24-bit), the DMA controllers will out-perform the CPU-driven I/O cards (SCSI or IDE). The HC8, with it's interleaved CPU/DMA 14MHz on-card memory will be the best multitasker, leaving the CPU free to run while DMA happens for I/O. Both controllers can, with Sync SCSI and a willing device, effectively saturate the Z2 bus, and the CPU-driven interfaces cannot, in terms of I/O speeds. For the record, in a best-case speed race for only I/O and no care for how much the CPU can't run, the A2091 is a hair better than the HC8 in this classic system configuration. It's 33C93A-DMAC buffer runs more optimally than the 33C93A-DPRC's does. The 33C93A technically DMA's into the receiving port on those DMA controller chips, and the 16-bit DMA transfer happens to/from the memory destination.
Move these two DMA cards up to the accelerator-driven systems with both DMA-capable Z2 AutoConfig 32-bit RAM and non-DMA-capable high RAM, and you incur all of the CPU-lockout benefits and issues inherent with Z2 bus masters plus the need to copy-up that data at times to the upper address ranges. Halving the Z2 bus performance-wise is where your I/O speed limit comes from as the data has to move twice. The argument for moving onto CPU-driven I/O gains good traction since the CPU has to be involved anyway. The 32-bit memory performance, if good and the Instruction cache is running the copy-up code, can get some good numbers off the Z2 bus (for IDE). This is where the Buddha has a suitable place in the field. There's still the 3.4-3.5MB/sec speed limit in Z2, though, and one can't hit all of the optimal access windows all the time. Sync-clocked CPU-slot cards will of course do better in that, but they are rare in the A2000 (A2620 being one, old RONIN 030 cards being another). In an odd case, the DPRC 16-bit FastRAM (when none is low-mapped on the 32-bit accessible accelerator cards - n/a to this A2630 memory configuration) doesn't do the CPU-block due to DMA, but the CPU copy loop is still tapping the slower 7MHz 16-bit RAM for DMA buffering. It's just a bit more CPU, and therefore multitasking-friendly.
Then we get into the A2000 accelerator combos with the onboard SCSI (or IDE) and DMA options. The GVP 030/040 cards with DPRC on them are admittedly more like the A2091 without the nice interleaved memory access design, but they pull 2-3MB/sec in optimal cases, and there's a hack on the GF040 in a special case where I've seen over 4.5MB/sec. Move up to the mid-1990's 040/060 class accelerators with the 53C7xx SCSI/DMA chips on them, and you are back again in favor of the DMA cards, in both I/O performance and CPU-friendly multitasking. The Blizzard and TekMagic cards are among this group, and are lightning-fast in both CPU power and I/O. I've actually been working on an A2000 system with the Buddha and the Blizzard 060 (to copy off data as I look to upgrade it from 3.1 w/some 3.5/3.9-era additions to 3.2.1 ROM/Workbench), and it's no slouch in I/O with that kind of CPU power. It's just not the most optimal. The buddha is mine in this equation, and so is just a temporary backup point, but if the guy didn't have tape, removable media, and several drives in an external tower, the Buddha might be a modern answer, too.
I left out the 68000-680x0 accelerators now showing up with IDE interfaces on them (and RAM), but must acknowledge them. They still use the CPU to effect transfers. My brother wrote back in he early 1990's, and I updated with a partner more recently, the RSCP benchmark. It helps show what impact I/O activity (DMA or CPU-driven transfers) can have on multitasking performance (CPU free time) when a system is 'under load'.
The answers, as I noted, are never straightforward