NVMe (Non-Volatile Memory Express) is a host-to-device protocol for PCIe-attached storage. The driver talks to the controller through paired ring buffers in host memory:
- SQ (Submission Queue): host writes commands here.
- CQ (Completion Queue): controller writes completion records here.
- SQE (Submission Queue Entry): one 64-byte command in the SQ. Each SQE has a 1-byte opcode, a 2-byte CID (Command Identifier) chosen by the host so the host can match a completion back to the request, and 16 dwords of parameters labeled
cdw0throughcdw15.cdw11is "Command Dword 11", a per-opcode parameter slot. - CQE (Completion Queue Entry): one 16-byte completion record in the CQ. Each CQE carries the CID it completes, a status code, and a phase bit that flips each time the controller wraps the ring, so the host can tell fresh entries from stale ones without polling a separate index.
- Doorbell: an MMIO register the host writes to tell the controller "new SQ tail" or "new CQ head".
- PRP (Physical Region Page): a 64-bit physical address slot in the SQE that points at the data buffer for the command. DMA (Direct Memory Access) is how the controller reads or writes that buffer without CPU help.
There are two queue pairs in this driver:
- Admin queue (qid=0): used once at init to identify the controller, list namespaces, and create the I/O queues. Polled, because it runs at boot.
- I/O queue (qid=1): used for every
Read/Write/Flushafter init. Interrupt-driven (with a polled fallback).
The interrupt mechanism is MSI-X (Message Signaled Interrupts, eXtended): a PCIe device raises an interrupt by writing a configured value to a configured memory address. Each device gets a small table of vectors; entry 0 in that table is what the I/O CQ is wired to here. An ISR (Interrupt Service Routine) is the function that runs when that interrupt fires.
SetupIoCompletionPlumbing (NVMeController.cs:149):
- Allocates a per-slot
InterruptEvent(NVMeController.cs:51). AnInterruptEventis a one-shot binary event that an ISR can signal and a normal thread can wait on without spinning. - Enables MSI-X on the device and binds
OnIoCompletionto MSI-X entry 0 (NVMeController.cs:170). - When the I/O CQ is created via the
Create I/O Completion Queueadmin command,cdw11setsIEN=1(Interrupt Enable) with interrupt vector 0, so the controller raises an MSI-X message on entry 0 for every completion it posts (NVMeController.cs:624).
SubmitOnSlot (NVMeController.cs:376) is called by Read / Write / Flush:
- Writes a 64-byte SQE into the I/O SQ and rings the SQ doorbell. Both steps happen under
_submitSqLockso concurrent callers do not clobber the shared SQ tail index. - Releases the lock and calls
slot.Done.Wait().Waitblocks the thread throughSchedulerManager.BlockThread+InternalCpu.Halt, so the CPU yields to other threads instead of spinning on the CQ (InterruptEvent.cs:40).
OnIoCompletion (NVMeController.cs:443) is the ISR for MSI-X entry 0:
- Walks every CQE whose phase bit matches the expected phase (those are the entries the controller has freshly written), advances the CQ head, and rings the CQ doorbell to tell the controller those slots are free.
- For each completed CQE, looks up the slot by CID and calls
slot.Done.Signal(), which wakes the parked submitter throughSchedulerManager.ReadyThread(InterruptEvent.cs:83). InterruptEventacquires its internal spinlock withAcquireIrqSafeon both theWaitandSignalsides, so the ISR does not deadlock against a same-CPUWaitthat already holds the lock when the interrupt fires.
Every slot owns its own InterruptEvent and its own 4 KiB DMA buffer (NVMeController.cs:49). Up to IoQueueDepth (8) commands can be in flight at once, with each producer thread parking on its own event.
If MSI-X is not available (the device has no MSI-X capability, or the platform has no MSI routing backend, e.g. ARM64 today per the class doc comment at line 25), MsiX.Enable returns null and the driver falls back to polling. In that mode WaitCompletion (NVMeController.cs:483) spins on the CQE phase bit, and _polledIoMutex serializes the whole submit-and-wait sequence so concurrent callers do not race on _ioCqHead / _ioCqPhase.
The serial line [NVMe] MSI-X unavailable, falling back to polled I/O at boot indicates that this path is active.