Hazard3 Closely-coupled Accelerators

This document describes the Hazard3 closely-coupled accelerator interface (CCA) and the associated the Xh3cca RISC-V extension. The purpose of the CCA interface is:

Higher write throughput into core-local accelerators
Support for I/O stalls that do not block debug and IRQs (impossible on AHB)
Access to accelerators without generating addresses first (reduced register pressure)
Access to accelerators without address-dependent protection checks (improved control-path timing)
Atomic (non-tearing) reads of 64-bit buses
Compatibility with vendor coprocessors designed for Cortex-M systems

Up to eight accelerators are connected to the core through a control bus and a pair of 2×XLEN-bit data buses (read and write).

The Xh3cca extension allocates the entirety of the custom-3 major opcode (instr[6:0] = 7'b1111011). It adds no new CSRs.

Throughput

Hazard3 supports the following CCA throughputs (XLEN=32):

32-bit or 64-bit write in one cycle
32-bit read in one cycle
64-bit read in two cycles (all inputs sampled on the same cycle)
64-bit write and a 32-bit read in one cycle

The bandwidth is limited by Hazard3's two-read one-write register file implementation. Support for 64-bit reads is necessary for compatibility with coprocessors designed for Cortex-M systems. The throughput of 64-bit reads may increase on future implementations.

Xh3cca Instructions

Closely-coupled accelerator instructions are allocated in the custom-3 opcode. For all instructions, funct3 (instr[14:12]) selects one of 8 accelerators.

h3.cca.dp

opcode = custom-3, instr[31:30] = 2'b00

Data processing. Issue an instruction to the accelerator without performing any data transfer between the core and accelerator.

The following instruction bits are usable for accelerator decode: 29:15, 11:7 (total = 20)

h3.cca.w

opcode = custom-3, instr[31:30] = 2'b01

Write XLEN bits from rs1 into accelerator.

The following instruction bits are usable for accelerator decode: 29:20, 11:7 (total = 15)

h3.cca.r

opcode = custom-3, instr[31:30] = 2'b10

Read XLEN bits from accelerator into rd.

The following instruction bits are usable for accelerator decode: 29:15 (total = 15)

h3.cca.ww

opcode = custom-3, instr[31:29] = 3'b110

Write 2×XLEN bits from rs1, rs2 into accelerator.

The following instruction bits are usable for accelerator decode: 28:25, 11:7 (total = 9)

h3.cca.rr

opcode = custom-3, instr[31:28] = 4'b1110

Read 2×XLEN bits from accelerator into rd, rd + 1. rd[0] must be 0 (else reserved). Data bus is captured into stage-X and stage-M result registers simultaneously so an atomic 64-bit read can be committed to the register file in two cycles.

The following instruction bits are usable for accelerator decode: 27:15 (total = 13)

h3.cca.rww

opcode = custom-3, instr[31:28] = 4'b1111

Write 2×XLEN bits from rs1, rs2 into accelerator. Simultaneously read XLEN bits from accelerator into rd. The accelerator read data may depend on the accelerator write data; equivalently the value returned in rd may be a function of rs1 and rs2, in addition to any internal accelerator state.

The following bits are usable for accelerator decode: 27:25 (total=3). The accelerator may also ignore the value of rs2 and decode bits 24:20 of the instruction for an additional 5 bits.

Opcode Bits for Accelerator Decode

The following table summarises which instruction bits are available for accelerator decode on which instructions:

Instruction	`29`	`28`	`27:25`	`24:20`	`19:15`	`11:7`	(Total)
`h3.cca.cdp`	Y	Y	Y	Y	Y	Y	20
`h3.cca.r`	Y	Y	Y	Y	Y		15
`h3.cca.rr`			Y	Y	Y		13
`h3.cca.w`	Y	Y	Y	Y		Y	15
`h3.cca.ww`		Y	Y			Y	9
`h3.cca.rww`			Y				3

CCA Interface Signals

Request phase, core -> accelerator:

cca_vld: core asserts to request access; may deassert if stalled by cca_rdylow. All other request-phase outputs must be ignored when cca_vld is deasserted.
cca_priv[1:0]: current privilege level, 3=M-mode 1=S-mode 0=U-mode
cca_select[2:0]: accelerator selection (funct3)
cca_opcode[19:0]: the following instruction bits: 29:15, 11:7
cca_ren: request a transfer from accelerator to core
cca_rsize: read data size (0 = XLEN, 1 = 2×XLEN)
cca_wen: request a transfer from core to accelerator
cca_wsize: write data size (0 = XLEN, 1 = 2×XLEN)

Request phase, accelerator -> core:

cca_rdy accelerator asserts to accept instruction. Instruction is accepted if cca_vld && cca_rdy
cca_err accelerator asserts to indicate invalid instruction. Valid whenever cca_rdy is asserted. The core takes an illegal instruction exception on cca_vld && cca_rdy && cca_err.

Data phase:

cca_wdata[63:0]: core -> accelerator, valid on the cycle after cca_vld && cca_rdy && !cca_err && cca_wen
cca_rdata[63:0]: accelerator -> core, valid on the cycle after cca_vld && cca_rdy && !cca_err && cca_ren

For h3.cca.dp instructions, neither wen nor ren is asserted. For h3.cca.rww instructions, both wen and ren are asserted. The control signals for each instruction are summarised by the following table:

Instruction	`wen`	`wsize`	`ren`	`rsize`
`h3.cca.cdp`	0	0	0	0
`h3.cca.r`	0	0	1	0
`h3.cca.rr`	0	0	1	1
`h3.cca.w`	1	0	0	0
`h3.cca.ww`	1	1	0	0
`h3.cca.rww`	1	1	1	0

RV64I Support

Xh3cca is specified for both 32-bit and 64-bit processors (XLEN=64). The width of the accelerator data buses is always twice XLEN.

A 32-bit accelerator can be attached to a 64-bit core by connecting bits (63:32, 31:0) of the accelerator data buses to bits (93:64, 31:0) of the core data buses, and tying unused inputs to zero. It is not possible in general to connect 64-bit accelerators to 32-bit cores.

A processor with 64-bit support (MXLEN=64) executing in 32-bit mode (xXLEN=32 for current mode x) uses bits 93:64 and 31:0 of the core data buses. Remaining core outputs are driven to zero, and remaining core inputs are ignored. This enables access to 32-bit accelerators when XLEN=32 but limits the use of 64-bit accelerators.

RV128I is not supported because the custom-3 opcode is reserved for standard use on this architecture.

Controlling Access to Accelerators

There is no core-side control of access to accelerators. Xh3cca considers accelerator access to be unprivileged.

Accelerators may implement access control by decoding the current core privilege level on cca_priv[1:0] and returning a decode error on cca_err if the privilege is insufficient for the requested operation.

The interconnect between the core and the accelerators can also implement access control by banking the 3-bit accelerator select space (cca_select[2:0]) across different values of cca_priv[1:0], so that different privilege levels see different accelerators or different subsets of the same group of accelerators.

Compatibility with Arm Coprocessors

The Xh3cca instructions h3.cca.r, h3.cca.rr, h3.cca.w, h3.cca.ww and h3.cca.dp are analogous to the Arm mrc, mrrc, mcr, mcrr and cdp instructions. These instructions were originally described in ARMv5TE and are still implemented today in Armv7-M, Armv8-M Main and Armv8.1-M Main.

The analogous Xh3cca instructions have (at least) the same number of undecoded opcode bits as their ARMv5TE counterparts, so coprocessor opcodes can be translated 1:1 at the boundary of the core with appropriate multiplexing of the opcode bits. Coprocessors designed for Cortex-M systems can be adapted for use with the Hazard3 CCA interface with minimal additional circuitry.

Accelerators designed specifically for Hazard3 CCA can take advantage of the higher throughput of h3.cca.rww (64-bit write and 32-bit read in one cycle), or use this instruction to implement custom register-register operations.

Notes on Opcode Space Requirements

mcr:
- 4-bit coprocessor select
- 1 opcode bit for mcr/mcr2
- 4 opcode bits in CRn specifier
- 4 opcode bits in CRm specifier
- 3 opcode bits in opc1
- 3 opcode bits in opc2
- One core register specifier Rt (would be rs1 on RISC-V)
- Total: 15 opcode bits plus one register
mcrr:
- 1 opcode bit for mcrr/mcrr2
- 4 opcode bits in opc1
- 4 opcode bits in CRm specifier
- Two core register specifiers Rt, Rt2 (rs1, rs2 in RISC-V terms)
- Total: 9 opcode bits plus two registers
mrc:
- 1 opcode bit for mrc/mrc2
- 4 opcode bits in CRn specifier
- 4 opcode bits in CRm specifier
- 3 opcode bits in opc1
- 3 opcode bits in opc2
- One core register specifier Rt (rd in RISC-V terms)
- Total: 15 opcode bits plus one register (identical to mcr except the register direction has changed)
mrrc:
- Opcode space requirements identical to mcrr
cdp:
- 1 opcode bit for cdp/cdp2
- 4 opcode bits in opc1
- 4 opcode bits in CRn
- 4 opcode bits in CRd
- 3 opcode bits in opc2
- 4 opcode bits in CRm
- Total: 20 opcode bits

Wren6991/xh3cca.md