Scalar stream (barrel) processor for TnL acceleration in xenowing

Goals

Input vertex attrib stream(s), output transformed vertex data stream(s)
- Which stream(s) depend on required capabilities - eg. we don't always want/need lighting
Vertex transformation/deformation
- Transform position vector into correct space, write out
- Should support "spikeballs" somehow - eg. scale object space coords by sinusoid whose phase depends on vertex/thread index
Vertex lighting
- Transform normal vector into correct space, dot product with light vector, scale rgba, write out
Relatively easy to program, probably via assembly (or similar)
- VLIW can be neat but latencies etc become the burden of the programmer(/compiler typically)
- I still don't really want to dig too far into the compiler territory with this project, I do that enough elsewhere
- Barrel processor helps trivialize this - the programmer only considers the kernel, no special-case fill/drain (as with SIMD or VLIW), etc

Scalar stream (barrel) processor
User's perspective: after programming/compiling kernel code (can/should be done offline), upload instruction memory and "extra parameter" memory (matrices etc), set up stream parameters (eg. 2 read streams, 1 write stream, aligned to 32 bits; we need pointers and "thread strides" per stream in addition to clearing caches), then dispatch N threads
M pipeline stages -> M threads in-flight (wide register files) on M hw contexts
Thread dispatch tries to dispatch a new thread each cycle onto a round-robin-selected hw context. If the hw context is already executing, issue the next instruction in that context instead. If N threads have already been dispatched, idle.
After N threads are issued and all hw contexts complete, flush write cache(s), signal completion
Note that stream loads/stores can potentially stall (cache miss); probably easiest to just stall the entire pipeline (all regs should have enable signals)

Arithmetic:
- Scalar add, sub, signed/unsigned multiply
- Possibly approx. reciprocal, maybe sqrt and/or reciprocal sqrt ("quake algorithm" is using fp32?)
- If fixed point, multiply will produce a 64-bit result. Should have built-in (arithmetic or logical) shift-right in instructions as well, which also implies we don't need dedicated instructions for these (though left-shifts might still be desirable, though perhaps not, as those can actually be calculated as multiplies!)
L/S:
- Load scalar from input stream (with stream id and target register)
- Store scalar to output stream (with stream id and source register)
Control:
- Finish current thread
  - Kindof annoying to add latency just to process some kind of "end" instruction - perhaps express this as an instruction bit (if we have enough free ones) instead? Alternatively, we could have program length in a separate register and automatically terminate a thread when its next PC matches this value.

It would be great to be able to specify programs at some sort of just-above-IR level in Rust code that could be compiled to instructions.
I don't think we need much more than liveness analysis/register allocation and some basic peephole optimizations (mainly to remove redundant moves after register allocation!); in fact, even structs can exist at the Rust "source-level"; "lowered" to scalar code without the compiler being aware of it.

When a stream is accessed more than once per thread, the accesses to the underlying stream (as seen from each thread) have a surprising order; that is, one with an interleave factor equal to the number of pipeline stages/execution contexts, rather than linear.
We'd like to maintain the illusion that each thread executes one-after-another, and that all accesses to the underlying streams are linear as if this were the case.
The best solution I've come up with so far is to statically determine a thread stride per stream. Each context has its own offset per stream which, on thread dispatch, is equal to the previous thread's offset for that stream (or 0 for thread 0) plus the stream's thread stride added. On each stream access within the execution of the thread, this offset is incremented as normal. This way, the unit accesses memory in a strided fashion rather than linear, but it looks as though it's linear from each thread's PoV. This puts greater importance on the cache structures used for each stream, but we don't need anything more than a single, basic read/write-only cache per-stream (which I had planned on adding anyways).
- The thread strides for all streams accessed by a kernel can be determined statically - since it's all straight-line code, we literally just need to count the number of loads/stores for each stream!

Transforms (matrix/vector mul) can be broken down into dot products (row vectors by input vector). Does it make sense to add dedicated hw support for this?
Fixed or FP?
Can the compiler be written as a const fn?

branching/looping (covered at top-level)
- This has nice implications, like each vertex requires the same number of cycles (sans stalls) to complete - thus if each is issued in-order, all stream accesses are predictable/ordered as well
Primitive assembly/tile binning (these should be handled by CPU or some other unit(s))
Proper normal calculations for spikeballs :)
DSP cases with feedback