This document describes the Hazard3 closely-coupled accelerator interface (CCA) and the associated the Xh3cca RISC-V extension. The purpose of the CCA interface is:
- Higher write throughput into core-local accelerators
- Support for I/O stalls that do not block debug and IRQs (impossible on AHB)
- Access to accelerators without generating addresses first (reduced register pressure)
- Access to accelerators without address-dependent protection checks (improved control-path timing)
- Atomic (non-tearing) reads of 64-bit buses
- Compatibility with vendor coprocessors designed for Cortex-M systems
Up to eight accelerators are connected to the core through a control bus and a pair of 2×XLEN-bit data buses (read and write).
The Xh3cca extension allocates the entirety of the custom-3 major opcode (instr[6:0] = 7'b1111011). It adds no new CSRs.
Hazard3 supports the following CCA throughputs (XLEN=32):
- 32-bit or 64-bit write in one cycle
- 32-bit read in one cycle
- 64-bit read in two cycles (all inputs sampled on the same cycle)
- 64-bit write and a 32-bit read in one cycle
The bandwidth is limited by Hazard3's two-read one-write register file implementation. Support for 64-bit reads is necessary for compatibility with coprocessors designed for Cortex-M systems. The throughput of 64-bit reads may increase on future implementations.
Closely-coupled accelerator instructions are allocated in the custom-3 opcode. For all instructions, funct3 (instr[14:12]) selects one of 8 accelerators.
opcode = custom-3, instr[31:30] = 2'b00
Data processing. Issue an instruction to the accelerator without performing any data transfer between the core and accelerator.
The following instruction bits are usable for accelerator decode: 29:15, 11:7 (total = 20)
opcode = custom-3, instr[31:30] = 2'b01
Write XLEN bits from rs1 into accelerator.
The following instruction bits are usable for accelerator decode: 29:20, 11:7 (total = 15)
opcode = custom-3, instr[31:30] = 2'b10
Read XLEN bits from accelerator into rd.
The following instruction bits are usable for accelerator decode: 29:15 (total = 15)
opcode = custom-3, instr[31:29] = 3'b110
Write 2×XLEN bits from rs1, rs2 into accelerator.
The following instruction bits are usable for accelerator decode: 28:25, 11:7 (total = 9)
opcode = custom-3, instr[31:28] = 4'b1110
Read 2×XLEN bits from accelerator into rd, rd + 1. rd[0] must be 0 (else reserved). Data bus is captured into stage-X and stage-M result registers simultaneously so an atomic 64-bit read can be committed to the register file in two cycles.
The following instruction bits are usable for accelerator decode: 27:15 (total = 13)
opcode = custom-3, instr[31:28] = 4'b1111
Write 2×XLEN bits from rs1, rs2 into accelerator. Simultaneously read XLEN bits from accelerator into rd. The accelerator read data may depend on the accelerator write data; equivalently the value returned in rd may be a function of rs1 and rs2, in addition to any internal accelerator state.
The following bits are usable for accelerator decode: 27:25 (total=3). The accelerator may also ignore the value of rs2 and decode bits 24:20 of the instruction for an additional 5 bits.
The following table summarises which instruction bits are available for accelerator decode on which instructions:
| Instruction | 29 |
28 |
27:25 |
24:20 |
19:15 |
11:7 |
(Total) |
|---|---|---|---|---|---|---|---|
h3.cca.cdp |
Y | Y | Y | Y | Y | Y | 20 |
h3.cca.r |
Y | Y | Y | Y | Y | 15 | |
h3.cca.rr |
Y | Y | Y | 13 | |||
h3.cca.w |
Y | Y | Y | Y | Y | 15 | |
h3.cca.ww |
Y | Y | Y | 9 | |||
h3.cca.rww |
Y | 3 |
Request phase, core -> accelerator:
cca_vld: core asserts to request access; may deassert if stalled bycca_rdylow. All other request-phase outputs must be ignored whencca_vldis deasserted.cca_priv[1:0]: current privilege level, 3=M-mode 1=S-mode 0=U-modecca_select[2:0]: accelerator selection (funct3)cca_opcode[19:0]: the following instruction bits:29:15,11:7cca_ren: request a transfer from accelerator to corecca_rsize: read data size (0 = XLEN, 1 = 2×XLEN)cca_wen: request a transfer from core to acceleratorcca_wsize: write data size (0 = XLEN, 1 = 2×XLEN)
Request phase, accelerator -> core:
cca_rdyaccelerator asserts to accept instruction. Instruction is accepted ifcca_vld && cca_rdycca_erraccelerator asserts to indicate invalid instruction. Valid whenevercca_rdyis asserted. The core takes an illegal instruction exception oncca_vld && cca_rdy && cca_err.
Data phase:
cca_wdata[63:0]: core -> accelerator, valid on the cycle aftercca_vld && cca_rdy && !cca_err && cca_wencca_rdata[63:0]: accelerator -> core, valid on the cycle aftercca_vld && cca_rdy && !cca_err && cca_ren
For h3.cca.dp instructions, neither wen nor ren is asserted. For h3.cca.rww instructions, both wen and ren are asserted. The control signals for each instruction are summarised by the following table:
| Instruction | wen |
wsize |
ren |
rsize |
|---|---|---|---|---|
h3.cca.cdp |
0 | 0 | 0 | 0 |
h3.cca.r |
0 | 0 | 1 | 0 |
h3.cca.rr |
0 | 0 | 1 | 1 |
h3.cca.w |
1 | 0 | 0 | 0 |
h3.cca.ww |
1 | 1 | 0 | 0 |
h3.cca.rww |
1 | 1 | 1 | 0 |
Xh3cca is specified for both 32-bit and 64-bit processors (XLEN=64). The width of the accelerator data buses is always twice XLEN.
A 32-bit accelerator can be attached to a 64-bit core by connecting bits (63:32, 31:0) of the accelerator data buses to bits (93:64, 31:0) of the core data buses, and tying unused inputs to zero. It is not possible in general to connect 64-bit accelerators to 32-bit cores.
A processor with 64-bit support (MXLEN=64) executing in 32-bit mode (xXLEN=32 for current mode x) uses bits 93:64 and 31:0 of the core data buses. Remaining core outputs are driven to zero, and remaining core inputs are ignored. This enables access to 32-bit accelerators when XLEN=32 but limits the use of 64-bit accelerators.
RV128I is not supported because the custom-3 opcode is reserved for standard use on this architecture.
There is no core-side control of access to accelerators. Xh3cca considers accelerator access to be unprivileged.
Accelerators may implement access control by decoding the current core privilege level on cca_priv[1:0] and returning a decode error on cca_err if the privilege is insufficient for the requested operation.
The interconnect between the core and the accelerators can also implement access control by banking the 3-bit accelerator select space (cca_select[2:0]) across different values of cca_priv[1:0], so that different privilege levels see different accelerators or different subsets of the same group of accelerators.
The Xh3cca instructions h3.cca.r, h3.cca.rr, h3.cca.w, h3.cca.ww and h3.cca.dp are analogous to the Arm mrc, mrrc, mcr, mcrr and cdp instructions. These instructions were originally described in ARMv5TE and are still implemented today in Armv7-M, Armv8-M Main and Armv8.1-M Main.
The analogous Xh3cca instructions have (at least) the same number of undecoded opcode bits as their ARMv5TE counterparts, so coprocessor opcodes can be translated 1:1 at the boundary of the core with appropriate multiplexing of the opcode bits. Coprocessors designed for Cortex-M systems can be adapted for use with the Hazard3 CCA interface with minimal additional circuitry.
Accelerators designed specifically for Hazard3 CCA can take advantage of the higher throughput of h3.cca.rww (64-bit write and 32-bit read in one cycle), or use this instruction to implement custom register-register operations.
mcr:- 4-bit coprocessor select
- 1 opcode bit for
mcr/mcr2 - 4 opcode bits in
CRnspecifier - 4 opcode bits in
CRmspecifier - 3 opcode bits in
opc1 - 3 opcode bits in
opc2 - One core register specifier
Rt(would bers1on RISC-V) - Total: 15 opcode bits plus one register
mcrr:- 1 opcode bit for
mcrr/mcrr2 - 4 opcode bits in
opc1 - 4 opcode bits in
CRmspecifier - Two core register specifiers
Rt,Rt2(rs1,rs2in RISC-V terms) - Total: 9 opcode bits plus two registers
- 1 opcode bit for
mrc:- 1 opcode bit for
mrc/mrc2 - 4 opcode bits in
CRnspecifier - 4 opcode bits in
CRmspecifier - 3 opcode bits in
opc1 - 3 opcode bits in
opc2 - One core register specifier
Rt(rdin RISC-V terms) - Total: 15 opcode bits plus one register (identical to
mcrexcept the register direction has changed)
- 1 opcode bit for
mrrc:- Opcode space requirements identical to
mcrr
- Opcode space requirements identical to
cdp:- 1 opcode bit for
cdp/cdp2 - 4 opcode bits in
opc1 - 4 opcode bits in
CRn - 4 opcode bits in
CRd - 3 opcode bits in
opc2 - 4 opcode bits in
CRm - Total: 20 opcode bits
- 1 opcode bit for